This project analyses the WHO “Countries and Death Cause” dataset, which contains approximately 30 years of mortality and risk-factor data across multiple countries.
The workflow includes: - Exploratory data analysis (EDA) to understand distributions, trends, and missingness - Predictive modelling (e.g., Decision Tree, K-Nearest Neighbors) to classify countries into high vs. low death-rate groups using selected risk factors - Clustering to identify groups of countries with similar mortality patterns
The project demonstrates an end-to-end analytics and modelling pipeline in R, with a focus on producing interpretable results that could support data-driven public health insights.
Load Packages
Load the necessary libraries for data manipulation, visualization, and machine learning models.
library(ggplot2)
library(pander)
library(knitr)
library(dplyr)
library(countrycode)
library(rpart)
library(rpart.plot)
library('ROCR')
library(tidyverse)
library(kableExtra)
library(skimr)
library(ROCit)
library(gridExtra)
library('class')
library(fpc)
library(PerformanceAnalytics)
library(corrplot)
library(RColorBrewer)
library(lime)
library(caret)
library('grDevices')
Visual setting
Apply color palettes from the RColorBrewer package to
enchance visualization.
cols = brewer.pal(n = 8, name = "Dark2")
plot_theme <- function() {
theme(
plot.title = element_text(hjust = 0.5, face = "bold", margin = margin(b = 25)),
panel.border = element_rect(color = "black", fill = NA, size = 0.75),
panel.grid.major = element_line(color = "grey", linetype = "dashed", size = 0.2),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.margin = margin(t = 20, r = 20, b = 15, l = 20),
axis.ticks.length = unit(7, "pt"),
axis.text = element_text(size = 11),
axis.title = element_text(size = 12),
axis.title.x = element_text(margin = margin(t = 14)),
axis.title.y = element_text(margin = margin(r = 14)) #
)
}
Load the death data, population and GDP data.
data_dir <- "../data/raw"
df_death <- read.csv(file.path(data_dir, "Countries and death causes.csv"), header = TRUE)
df_pop <- read.csv(file.path(data_dir, "API_SP.POP.TOTL_DS2_en_csv_v2_31753.csv"), skip = 4, header = TRUE)
df_gdp <- read.csv(file.path(data_dir, "API_NY.GDP.PCAP.CD_DS2_en_csv_v2_31681.csv"), skip = 4, header = TRUE)
Death Data: The death data used in this project was collected from the World Health Organization (WHO), which provides yearly death counts across 228 entities from 1990 to 2019, including various causes of death (World Health Organization, n.d.) (Provided in Project Instruction).
Population Data: The population data was sourced from the World Bank, providing total population figures for each country (World Bank, n.d.) (Population Data).
GDP Data: GDP data was also obtained from the World Bank, representing GDP per capita (current US$) as an indicator of the economic status of each country (World Bank, n.d.) (GDP Data).
This section explores the structure and types of variables in the
df_death dataset, which is essential for guiding the
subsequent data handling steps.
Data Structure and Types
The first step is to observe the first few rows, examine the data structure and review the summary.
cat("Dataset size is", dim(df_death))
## Dataset size is 6840 31
head(df_death) %>% knitr::kable() %>% kable_styling(font_size = 12, bootstrap_options = "basic", full_width = FALSE) %>% scroll_box(height = "50%", width = "100%")
| Entity | Code | Year | Outdoor.air.pollution | High.systolic.blood.pressure | Diet.high.in.sodium | Diet.low.in.whole.grains | Alochol.use | Diet.low.in.fruits | Unsafe.water.source | Secondhand.smoke | Low.birth.weight | Child.wasting | Unsafe.sex | Diet.low.in.nuts.and.seeds | Household.air.pollution.from.solid.fuels | Diet.low.in.Vegetables | Low.physical.activity | Smoking | High.fasting.plasma.glucose | Air.pollution | High.body.mass.index | Unsafe.sanitation | No.access.to.handwashing.facility | Drug.use | Low.bone.mineral.density | Vitamin.A.deficiency | Child.stunting | Discontinued.breastfeeding | Non.exclusive.breastfeeding | Iron.deficiency |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | AFG | 1990 | 3169 | 25633 | 1045 | 7077 | 356 | 3185 | 3702 | 4794 | 16135 | 19546 | 351 | 2319 | 34372 | 3679 | 2637 | 5174 | 11449 | 37231 | 9518 | 2798 | 4825 | 174 | 389 | 2016 | 7686 | 107 | 2216 | 564 |
| Afghanistan | AFG | 1991 | 3222 | 25872 | 1055 | 7149 | 364 | 3248 | 4309 | 4921 | 17924 | 20334 | 361 | 2449 | 35392 | 3732 | 2652 | 5247 | 11811 | 38315 | 9489 | 3254 | 5127 | 188 | 389 | 2056 | 7886 | 121 | 2501 | 611 |
| Afghanistan | AFG | 1992 | 3395 | 26309 | 1075 | 7297 | 376 | 3351 | 5356 | 5279 | 21200 | 22895 | 378 | 2603 | 38065 | 3827 | 2688 | 5363 | 12265 | 41172 | 9528 | 4042 | 5889 | 211 | 393 | 2100 | 8568 | 150 | 3053 | 700 |
| Afghanistan | AFG | 1993 | 3623 | 26961 | 1103 | 7499 | 389 | 3480 | 7152 | 5734 | 23795 | 27002 | 395 | 2771 | 41154 | 3951 | 2744 | 5522 | 12821 | 44488 | 9611 | 5392 | 7007 | 232 | 411 | 2316 | 9875 | 204 | 3726 | 773 |
| Afghanistan | AFG | 1994 | 3788 | 27658 | 1134 | 7698 | 399 | 3610 | 7192 | 6050 | 24866 | 29205 | 410 | 2932 | 43153 | 4075 | 2805 | 5689 | 13400 | 46634 | 9675 | 5418 | 7421 | 247 | 413 | 2665 | 11031 | 204 | 3833 | 812 |
| Afghanistan | AFG | 1995 | 3869 | 28090 | 1154 | 7807 | 406 | 3703 | 8378 | 6167 | 25534 | 30943 | 422 | 3049 | 44024 | 4153 | 2839 | 5801 | 13871 | 47566 | 9608 | 6313 | 7896 | 260 | 417 | 3070 | 11973 | 233 | 4124 | 848 |
# to understand the data types of each variable
# str(df_death)
summary(df_death) %>% knitr::kable() %>% kable_styling(font_size = 12, bootstrap_options = "basic", full_width = FALSE) %>% scroll_box(width = "100%")
| Entity | Code | Year | Outdoor.air.pollution | High.systolic.blood.pressure | Diet.high.in.sodium | Diet.low.in.whole.grains | Alochol.use | Diet.low.in.fruits | Unsafe.water.source | Secondhand.smoke | Low.birth.weight | Child.wasting | Unsafe.sex | Diet.low.in.nuts.and.seeds | Household.air.pollution.from.solid.fuels | Diet.low.in.Vegetables | Low.physical.activity | Smoking | High.fasting.plasma.glucose | Air.pollution | High.body.mass.index | Unsafe.sanitation | No.access.to.handwashing.facility | Drug.use | Low.bone.mineral.density | Vitamin.A.deficiency | Child.stunting | Discontinued.breastfeeding | Non.exclusive.breastfeeding | Iron.deficiency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:6840 | Length:6840 | Min. :1990 | Min. : 0.0 | Min. : 2 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 1 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0.0 | Min. : 0.0 | Min. : 1 | Min. : 3 | Min. : 0.0 | Min. : 2.0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.00 | Min. : 0.0 | Min. : 0 | |
| Class :character | Class :character | 1st Qu.:1997 | 1st Qu.: 433.8 | 1st Qu.: 1828 | 1st Qu.: 137.0 | 1st Qu.: 273.8 | 1st Qu.: 263.8 | 1st Qu.: 144.0 | 1st Qu.: 7.0 | 1st Qu.: 209 | 1st Qu.: 123 | 1st Qu.: 26 | 1st Qu.: 97 | 1st Qu.: 27 | 1st Qu.: 32 | 1st Qu.: 109.0 | 1st Qu.: 92.0 | 1st Qu.: 894 | 1st Qu.: 1178 | 1st Qu.: 815.8 | 1st Qu.: 918.5 | 1st Qu.: 3 | 1st Qu.: 19 | 1st Qu.: 31 | 1st Qu.: 43 | 1st Qu.: 0.0 | 1st Qu.: 1.0 | 1st Qu.: 0.00 | 1st Qu.: 3.0 | 1st Qu.: 1 | |
| Mode :character | Mode :character | Median :2004 | Median : 2101.0 | Median : 8770 | Median : 969.5 | Median : 1444.0 | Median : 1780.5 | Median : 834.5 | Median : 182.5 | Median : 994 | Median : 1057 | Median : 504 | Median : 619 | Median : 252 | Median : 821 | Median : 590.5 | Median : 521.5 | Median : 4987 | Median : 4966 | Median : 5748.0 | Median : 3917.0 | Median : 102 | Median : 221 | Median : 222 | Median : 277 | Median : 2.0 | Median : 41.5 | Median : 4.00 | Median : 60.5 | Median : 12 | |
| NA | NA | Mean :2004 | Mean : 84582.4 | Mean : 224225 | Mean : 40497.2 | Mean : 38691.3 | Mean : 54848.6 | Mean : 23957.8 | Mean : 44086.4 | Mean : 30364 | Mean : 59126 | Mean : 49924 | Mean : 27646 | Mean : 12996 | Mean : 83642 | Mean : 11982.5 | Mean : 16489.1 | Mean : 181958 | Mean : 117554 | Mean : 164752.3 | Mean : 89869.9 | Mean : 31522 | Mean : 21800 | Mean : 10285 | Mean : 8182 | Mean : 2471.6 | Mean : 11164.3 | Mean : 431.46 | Mean : 7171.9 | Mean : 1421 | |
| NA | NA | 3rd Qu.:2012 | 3rd Qu.: 11810.2 | 3rd Qu.: 40356 | 3rd Qu.: 5169.8 | 3rd Qu.: 6773.2 | 3rd Qu.: 8368.0 | 3rd Qu.: 3104.8 | 3rd Qu.: 5599.2 | 3rd Qu.: 4348 | 3rd Qu.: 10903 | 3rd Qu.: 9765 | 3rd Qu.: 4492 | 3rd Qu.: 1998 | 3rd Qu.: 10870 | 3rd Qu.: 2101.8 | 3rd Qu.: 2820.2 | 3rd Qu.: 23994 | 3rd Qu.: 21639 | 3rd Qu.: 25049.8 | 3rd Qu.: 17967.8 | 3rd Qu.: 3854 | 3rd Qu.: 3954 | 3rd Qu.: 1224 | 3rd Qu.: 1232 | 3rd Qu.: 230.2 | 3rd Qu.: 1563.2 | 3rd Qu.: 71.25 | 3rd Qu.: 1315.5 | 3rd Qu.: 238 | |
| NA | NA | Max. :2019 | Max. :4506193.0 | Max. :10845595 | Max. :1885356.0 | Max. :1844836.0 | Max. :2441973.0 | Max. :1046015.0 | Max. :2450944.0 | Max. :1304318 | Max. :3033425 | Max. :3430422 | Max. :1664813 | Max. :575139 | Max. :4358214 | Max. :529381.0 | Max. :831502.0 | Max. :7693368 | Max. :6501398 | Max. :6671740.0 | Max. :5019360.0 | Max. :1842275 | Max. :1200349 | Max. :494492 | Max. :437884 | Max. :207555.0 | Max. :833449.0 | Max. :33106.00 | Max. :505470.0 | Max. :73461 |
Summary statistics
The skimr package provives concise summaries of each
variable, highlighting aspects such as missing values and data
distribution [Waring et al., 2022].
skim(df_death) %>% knitr::kable() %>% kable_styling(font_size = 12, bootstrap_options = "basic", full_width = FALSE) %>% scroll_box(height = "400px", width = "100%")
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | Entity | 0 | 1 | 3 | 34 | 0 | 228 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| character | Code | 0 | 1 | 0 | 8 | 690 | 206 | 0 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | Year | 0 | 1 | NA | NA | NA | NA | NA | 2004.5000 | 8.656074e+00 | 1990 | 1997.00 | 2004.5 | 2012.00 | 2019 | ▇▇▇▇▇ |
| numeric | Outdoor.air.pollution | 0 | 1 | NA | NA | NA | NA | NA | 84582.4437 | 3.511973e+05 | 0 | 433.75 | 2101.0 | 11810.25 | 4506193 | ▇▁▁▁▁ |
| numeric | High.systolic.blood.pressure | 0 | 1 | NA | NA | NA | NA | NA | 224224.8618 | 8.634691e+05 | 2 | 1827.75 | 8770.5 | 40355.50 | 10845595 | ▇▁▁▁▁ |
| numeric | Diet.high.in.sodium | 0 | 1 | NA | NA | NA | NA | NA | 40497.1636 | 1.752832e+05 | 0 | 137.00 | 969.5 | 5169.75 | 1885356 | ▇▁▁▁▁ |
| numeric | Diet.low.in.whole.grains | 0 | 1 | NA | NA | NA | NA | NA | 38691.2923 | 1.479084e+05 | 0 | 273.75 | 1444.0 | 6773.25 | 1844836 | ▇▁▁▁▁ |
| numeric | Alochol.use | 0 | 1 | NA | NA | NA | NA | NA | 54848.6028 | 2.112090e+05 | 0 | 263.75 | 1780.5 | 8368.00 | 2441973 | ▇▁▁▁▁ |
| numeric | Diet.low.in.fruits | 0 | 1 | NA | NA | NA | NA | NA | 23957.7633 | 9.451573e+04 | 0 | 144.00 | 834.5 | 3104.75 | 1046015 | ▇▁▁▁▁ |
| numeric | Unsafe.water.source | 0 | 1 | NA | NA | NA | NA | NA | 44086.3830 | 2.020493e+05 | 0 | 7.00 | 182.5 | 5599.25 | 2450944 | ▇▁▁▁▁ |
| numeric | Secondhand.smoke | 0 | 1 | NA | NA | NA | NA | NA | 30364.0077 | 1.222861e+05 | 1 | 209.00 | 994.0 | 4347.75 | 1304318 | ▇▁▁▁▁ |
| numeric | Low.birth.weight | 0 | 1 | NA | NA | NA | NA | NA | 59125.5130 | 2.502265e+05 | 0 | 123.00 | 1057.0 | 10903.25 | 3033425 | ▇▁▁▁▁ |
| numeric | Child.wasting | 0 | 1 | NA | NA | NA | NA | NA | 49924.3652 | 2.226529e+05 | 0 | 26.00 | 504.0 | 9764.75 | 3430422 | ▇▁▁▁▁ |
| numeric | Unsafe.sex | 0 | 1 | NA | NA | NA | NA | NA | 27645.9702 | 1.253130e+05 | 0 | 97.00 | 619.0 | 4492.00 | 1664813 | ▇▁▁▁▁ |
| numeric | Diet.low.in.nuts.and.seeds | 0 | 1 | NA | NA | NA | NA | NA | 12996.3493 | 5.118331e+04 | 0 | 27.00 | 252.0 | 1998.50 | 575139 | ▇▁▁▁▁ |
| numeric | Household.air.pollution.from.solid.fuels | 0 | 1 | NA | NA | NA | NA | NA | 83641.4785 | 3.510360e+05 | 0 | 32.00 | 821.0 | 10870.50 | 4358214 | ▇▁▁▁▁ |
| numeric | Diet.low.in.Vegetables | 0 | 1 | NA | NA | NA | NA | NA | 11982.4572 | 4.532998e+04 | 0 | 109.00 | 590.5 | 2101.75 | 529381 | ▇▁▁▁▁ |
| numeric | Low.physical.activity | 0 | 1 | NA | NA | NA | NA | NA | 16489.0842 | 6.270800e+04 | 0 | 92.00 | 521.5 | 2820.25 | 831502 | ▇▁▁▁▁ |
| numeric | Smoking | 0 | 1 | NA | NA | NA | NA | NA | 181957.5259 | 7.186093e+05 | 1 | 894.00 | 4987.0 | 23993.50 | 7693368 | ▇▁▁▁▁ |
| numeric | High.fasting.plasma.glucose | 0 | 1 | NA | NA | NA | NA | NA | 117554.1887 | 4.521602e+05 | 3 | 1178.00 | 4966.5 | 21639.00 | 6501398 | ▇▁▁▁▁ |
| numeric | Air.pollution | 0 | 1 | NA | NA | NA | NA | NA | 164752.2816 | 6.563585e+05 | 0 | 815.75 | 5748.0 | 25049.75 | 6671740 | ▇▁▁▁▁ |
| numeric | High.body.mass.index | 0 | 1 | NA | NA | NA | NA | NA | 89869.9156 | 3.450420e+05 | 2 | 918.50 | 3917.0 | 17967.75 | 5019360 | ▇▁▁▁▁ |
| numeric | Unsafe.sanitation | 0 | 1 | NA | NA | NA | NA | NA | 31521.5469 | 1.463434e+05 | 0 | 3.00 | 102.0 | 3854.00 | 1842275 | ▇▁▁▁▁ |
| numeric | No.access.to.handwashing.facility | 0 | 1 | NA | NA | NA | NA | NA | 21799.8944 | 9.668259e+04 | 0 | 19.00 | 221.0 | 3953.50 | 1200349 | ▇▁▁▁▁ |
| numeric | Drug.use | 0 | 1 | NA | NA | NA | NA | NA | 10285.2020 | 3.996075e+04 | 0 | 31.00 | 222.0 | 1224.25 | 494492 | ▇▁▁▁▁ |
| numeric | Low.bone.mineral.density | 0 | 1 | NA | NA | NA | NA | NA | 8182.4732 | 3.240392e+04 | 0 | 43.00 | 277.0 | 1232.00 | 437884 | ▇▁▁▁▁ |
| numeric | Vitamin.A.deficiency | 0 | 1 | NA | NA | NA | NA | NA | 2471.5944 | 1.271830e+04 | 0 | 0.00 | 2.0 | 230.25 | 207555 | ▇▁▁▁▁ |
| numeric | Child.stunting | 0 | 1 | NA | NA | NA | NA | NA | 11164.3297 | 5.286625e+04 | 0 | 1.00 | 41.5 | 1563.25 | 833449 | ▇▁▁▁▁ |
| numeric | Discontinued.breastfeeding | 0 | 1 | NA | NA | NA | NA | NA | 431.4567 | 1.901532e+03 | 0 | 0.00 | 4.0 | 71.25 | 33106 | ▇▁▁▁▁ |
| numeric | Non.exclusive.breastfeeding | 0 | 1 | NA | NA | NA | NA | NA | 7171.8531 | 3.167845e+04 | 0 | 3.00 | 60.5 | 1315.50 | 505470 | ▇▁▁▁▁ |
| numeric | Iron.deficiency | 0 | 1 | NA | NA | NA | NA | NA | 1421.3636 | 6.303932e+03 | 0 | 1.00 | 12.0 | 238.00 | 73461 | ▇▁▁▁▁ |
Conclusion
The death data contains 6,840 observations of 31 variables. The two are categorical variables:
Entity: represents countries and organisations with 228 unique entities. Each has 30 rows (data spanning from 1990 - 2019), ensuring a complete coverage for all entities. The variable ‘Entity’ will be renamed to ‘Country’.
Code: represents country codes. There are 690 rows where the Code is missing, and 30 entities have non-standard codes (do not have exact three characters), typically indicating organizations or groups of countries. These rows will be removed in the cleaning step.
For the numerical variables:
Year: currently stored as an integer, it will be treated as a categorical variable to represent different time periods.
28 death-related variables: shows typical data
issues such as invalid values and wide data range. For example,
Alcohol.Use, ranges from 0 to 2,441,973, suggesting
potential inconsistencies. The minimum value 0 might be valid for some
countries. However, it could also represent missing data that was not
recorded. Outliers may be associated to countries with large populations
or aggregated country groups.
# invalid country codes
nrow(df_death[nchar(df_death$Code) !=0 & nchar(df_death$Code) != 3, ])
## [1] 30
Integrating population and GDP data with death data is important for comprehensive analysis. The population and GDP datasets cover yearly data for each country from 1990 to 2019, which aligns with the periods available in the death data.
Data Overview
GDP indicator: GDP is measured per capita in current US dollars.
# initial check
# head(df_pop) %>% knitr::kable()
# head(df_gdp) %>% knitr::kable()
# str(df_pop) %>% knitr::kable()
# str(df_gdp) %>% knitr::kable()
# summary(df_pop) %>% knitr::kable()
# summary(df_gdp) %>% knitr::kable()
Data Preparation
First, subset the population and GDP data for the years 1990 to 2019 and rename the columns for merging. The tidyr package was used to restructure the data for merging, utilizing the pivot_longer() function to transform wide data into a long format (Wickham, 2020).
# subset and rename columns
df_pop <- subset(df_pop, select = c(Country.Code, X1990:X2019))
colnames(df_pop) <- c('Code', as.character(1990:2019))
df_gdp <- subset(df_gdp, select = c(Country.Code, X1990:X2019))
colnames(df_gdp) <- c('Code', as.character(1990:2019))
# reshape to long format
df_pop_long <- tidyr::pivot_longer(df_pop, cols = -Code, names_to = "Year", values_to = "Population")
df_pop_long$Year <- as.integer(df_pop_long$Year)
df_gdp_long <- tidyr::pivot_longer(df_gdp, cols = -Code, names_to = "Year", values_to = "GDP")
df_gdp_long$Year <- as.integer(df_gdp_long$Year)
Data Merging
Merge population and GDP data first, then combine the result with the death data.
# merge population and GDP
df_pop_gdp <- dplyr::left_join(df_pop_long, df_gdp_long, by = c("Code", "Year"))
# merge death_country and pop.gdp
df_merged <- dplyr::left_join(df_death, df_pop_gdp, by = c("Code", "Year"))
Visualization is a critical step in data analysis, helping to identify underlying patterns, identify outliers, and help with the next step of data cleaning and transformation. This section uses different types of plots and charts in an iterative process, and use visualization before and after data cleaning.
Single Variable Analysis
First step is to explore variation within different variables, including their Distribution, outliers.
This section starts with variable "Smoking" as a example
to explore the overall distribution, outliers and range. Smoking is a
significant health concern globally and a leading cause of death in many
countries.
Histogram
# histogram to explore distribution
p1 <- ggplot(df_merged, aes(x = Smoking)) +
geom_histogram(binwidth = 50000, fill = "#1B9E77") +
ggtitle("Histogram of Smoking") +
xlab("Smoking") +
ylab("Count") +
plot_theme()
# log transformation to reduce skewness
x <- log(df_merged$Smoking + 1)
p2 <- ggplot(df_merged, aes(x = x)) +
geom_histogram(binwidth = 1, fill = "#1B9E77", color = "white") +
ggtitle("Histogram of Log Smoking") +
xlab("Log Smoking") +
ylab("Count") +
plot_theme()
grid.arrange(p1,p2, ncol = 2)
The histograms reveal a right-skewed distribution. A Log transformation can help reduce this skewness. Several countries, such as small countries and islands, have fewer than 15 deaths.
unique(df_merged[df_merged$Smoking < 15, "Entity"])
## [1] "Nauru" "Niue" "Tokelau"
Meanwhile, countries with large population, such as China and India, have more than 900,000 deaths due to smoking. This pattern is also found in large groups of countries, such as European Region.
unique(df_merged[df_merged$Smoking >900000, "Entity"])
## [1] "China" "East Asia & Pacific (WB)"
## [3] "Europe & Central Asia (WB)" "European Region (WHO)"
## [5] "G20" "India"
## [7] "OECD Countries" "Region of the Americas (WHO)"
## [9] "South Asia (WB)" "South-East Asia Region (WHO)"
## [11] "Western Pacific Region (WHO)" "World"
## [13] "World Bank High Income" "World Bank Lower Middle Income"
## [15] "World Bank Upper Middle Income"
Insights for distinguishing countries:
Population size: a key factor for calculating death rates. It allows for more comparable analysis across countries
GDP: economic factors can help identify how the income levels affect health outcomes
Geographical impact: the continent or region may influence health conditions
Boxplots
# boxplots
p3 <- ggplot(df_merged, aes(x = "", y = Smoking)) +
geom_boxplot(fill = "#1B9E77",outlier.colour = "#D95F02") + ggtitle("Boxplot of Smoking") +
plot_theme()
# log transformation
y4 <- log(df_merged$Smoking)
p4 <- ggplot(df_merged, aes(x = "", y = y4)) +
geom_boxplot(fill = "#1B9E77",outlier.colour = "#D95F02") +
ggtitle("Boxplot of Log Smoking") +
plot_theme()
grid.arrange(p3,p4, ncol = 2)
The boxplots confirms the effect of log transformation. The original box plot shows a wide range of outliers, due to concentration of data in low range. The log transformation significantly reduce the effect of outliers and make it more centralized.
Select dietary factors: For
"Diet.low.in.fruits" and
"Diet.high.in.sodium". They have similar distribution after
log transformation. Both of them shows several outliers, even after
transformation.
# boxplots for Diet.low.in.fruits
y5 <- log(df_merged$Diet.low.in.fruits)
p5 <- ggplot(df_merged, aes(x = "", y = y5)) +
geom_boxplot(fill = "#1B9E77",outlier.colour = "#D95F02") +
ggtitle("Boxplot of Log Diet.low.in.fruits") +
plot_theme()
# boxplots for Diet.high.in.sodium
y6 <- log(df_merged$Diet.high.in.sodium)
p6 <- ggplot(df_merged, aes(x = "", y = y6)) +
geom_boxplot(fill = "#1B9E77",outlier.colour = "#D95F02") +
ggtitle("Boxplot of Log Diet.high.in.sodium") +
plot_theme()
grid.arrange(p5,p6, ncol = 2)
Density Plot
# density plot for unsafe water source
p7 <- ggplot(df_merged, aes(x = Unsafe.water.source)) +
geom_histogram(aes(y = ..density..), binwidth = 50000, fill = "#1B9E77", alpha = 0.4, color = "white" ) +
geom_density(color = "#1B9E77", fill = "#1B9E77", alpha = 0.4, size = 1) +
ggtitle("Density Plot of Unsafe.water.source") +
xlab("Unsafe.water.source") +
ylab("Density") +
plot_theme()
# log
x8 <- log(df_merged$Unsafe.water.source)
p8 <- ggplot(df_merged, aes(x = x8)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "#1B9E77", alpha = 0.4, color = "white" ) +
geom_density(color = "#1B9E77", fill = "#1B9E77", alpha = 0.3, size = 1) +
ggtitle("Density Plot of Log Unsafe.water.source") +
xlab("Log Unsafe.water.source") +
ylab("Density") +
plot_theme()
grid.arrange(p7,p8, ncol = 2)
The density plot of "Unsafe.water.source" shows an
extremely skewed distribution with majority of observations concentrated
near small values. This skewness indicates that most regions record low
levels, while a few with significantly high levels.
Outlier analysis
After the log transformation, the distribution of
"Unsafe.water.source" becomes more normalized than before,
but still shows 421 outliers, indicating that some regions are
disproportionately affected by unsafe water sources.
boxplot.stats(df_merged$Unsafe.water.source)$stats
## [1] 0.0 7.0 182.5 5599.5 13944.0
length(boxplot.stats(df_merged$Unsafe.water.source)$out)
## [1] 1092
# after transformation
boxplot.stats(x8)$stats
## [1] 0.000000 1.945910 5.206746 8.630433 14.711984
length(boxplot.stats(x8)$out)
## [1] 421
Multiple Variables Analysis
Next is to explore covariation between different variables by scatterplots and correlation analysis.
Scatterplots
chart.Correlation(df_merged[, -c(1:3)], histogram=TRUE)
The above scatter matrix shows the relationships between pairs of cause-related variables. There are strong positive correlations among several variables. Some scatterplots reveal clusters and outliers. It may indicate the difference between certain group of countries.
Correlation matrix
cor_matrix <- cor(df_merged[, -c(1:3)], use = "pairwise.complete.obs")
corrplot(cor_matrix, method = "color",type = "full",order = "hclust",tl.col = "black",tl.srt = 90,tl.cex = 0.6, number.cex = 0.5,cl.cex = 1.2,addCoef.col = "black")
High correlations are found among similar cause-related variables.
For example, dietary factors such as "Diet.low.infruit",
"Diet.high.in.sodium", and “Diet.low.in.vegetables” show
strong correlations, suggesting their similar influences and allowing
them to be grouped together as a common “Diet” category. A similar
pattern can be found in “Unsafe.water.source” and “Unsafe.sanitation”,
indicating a potential grouping related to water and environmental
factors. This can help to reduce variable redundancy and guide the
feature selection for machine learning models.
Remove invalid country codes
Drop organization entities and invalid country codes, such as regions or non-country entities. The purpose of this project is to focus on country-level data.
df_clean <- df_merged %>%
filter(!is.na(Code) & Code != "") %>%
filter(nchar(Code) == 3)
Rename columns: To improve consistency, “Entity” is
renamed to "Country". A type in the column name
"Alochol.use" is corrected.
df_clean <- df_clean %>%
rename(Country = Entity) %>%
rename(Alcohol.use = Alochol.use)
Continent: The countrycode package is
used to add a “Continent” column, indicating the continent each country
belongs to. This addition helps in examining geographic effects on death
rates (Arel-Bundock et al., 2023). Missing continent value is manually
corrected.
df_clean$Continent = countrycode(as.character(df_clean$Country), "country.name","continent")
# missing value
df_clean <- df_clean %>%
mutate(
Continent = ifelse(Code == 'FSM', 'Oceania', Continent)
) %>% as.data.frame()
Year: Converted to a categorical variable.
df_clean$Year <- factor(df_clean$Year)
# str(df_clean)
Handling missing data
Population: To ensure population information is complete, remove any rows with missing data
GDP
Impute missing values: if a country has missing GDP data for certain years, these values are replaced with the median GDP for that country.
If GDP data is missing for all years, that country is removed
# drop rows with missing population data
df_clean <- df_clean %>%
filter(!is.na(Population))
# impute missing GDP value
df_clean <- df_clean %>%
group_by(Code) %>%
mutate(GDP = ifelse(is.na(GDP), median(GDP, na.rm = TRUE), GDP)) %>%
ungroup()
# drop rows with missing GDP for all years
df_clean <- df_clean %>%
filter(!is.na(GDP))
This feature selection process is guided by Global Burden of Disease (GBD) studied by the Institute for Health Metrics and Evaluation (IHME). The GBD study provides a comprehensive evaluation of global health trends and includes factors that contribute to deaths across a wide range of causes (IHME, Global Burden of Disease, 2024).
The death dataset contains 28 risk factors, categorized in detail according to GBD’s Level3 and Level4 classification. To simplify these features, they were mapped into IHME’s three-level framework.
col = brewer.pal(n = 3, name = "Dark2")
risk_factors <- data.frame(
Level1 = c(rep("Behavioral", 11), rep("Environmental", 2), rep("Metabolic", 4)),
Level2 = c("", "", "", "Iron.deficiency","Vitamin.A.deficiency", "Drug.use",
"Alcohol.use","Unsafe.sex","Low.physical.activity","","", "","Air.pollution",
"High.systolic.blood.pressure", "High.fasting.plasma.glucose",
"High.body.mass.index", "Low.bone.mineral.density"),
Level3 = c("No.exclusive.breastfeeding, Discontinued.breastfeeding",
"Child.wasting, Child .stunting", "Low.birth.weight","", "", "", "", "", "", "Smoking, Secondhand.smoke", "Diet.high.in.sodium, Diet.low.in.whole.grains,
Diet.low.in.fruits, Diet.low.in.nuts.and.seeds, Diet.low.in.vegetables",
"No.access.to.handwashing.facility, Unsafe.water.source, Unsafe.sanitation",
"Household.air.pollution.from.solid.fuels, Outdoor.air.pollution","","","","")
)
kable(risk_factors, "html", escape = FALSE) %>% kable_styling(font_size = 12, bootstrap_options = "basic", full_width = FALSE) %>% scroll_box(height = "50%", width = "100%") %>%
row_spec(0, bold = T, color = "black") %>%
row_spec(which(risk_factors$Level1 == "Behavioral"), bold = T, color = col) %>%
row_spec(which(risk_factors$Level1 == "Environmental"), bold = T, color = col[3]) %>%
row_spec(which(risk_factors$Level1 == "Metabolic"), bold = T, color =col[2])
| Level1 | Level2 | Level3 |
|---|---|---|
| Behavioral | No.exclusive.breastfeeding, Discontinued.breastfeeding | |
| Behavioral | Child.wasting, Child .stunting | |
| Behavioral | Low.birth.weight | |
| Behavioral | Iron.deficiency | |
| Behavioral | Vitamin.A.deficiency | |
| Behavioral | Drug.use | |
| Behavioral | Alcohol.use | |
| Behavioral | Unsafe.sex | |
| Behavioral | Low.physical.activity | |
| Behavioral | Smoking, Secondhand.smoke | |
| Behavioral | Diet.high.in.sodium, Diet.low.in.whole.grains, Diet.low.in.fruits, Diet.low.in.nuts.and.seeds, Diet.low.in.vegetables | |
| Environmental | No.access.to.handwashing.facility, Unsafe.water.source, Unsafe.sanitation | |
| Environmental | Air.pollution | Household.air.pollution.from.solid.fuels, Outdoor.air.pollution |
| Metabolic | High.systolic.blood.pressure | |
| Metabolic | High.fasting.plasma.glucose | |
| Metabolic | High.body.mass.index | |
| Metabolic | Low.bone.mineral.density |
From the correlation analysis, it was evident that many of the original variables have high correlations with each other. This supports the feature selection and grouping process.
High correlations among similar cause-related variables
"Diet.low.in.fruits",
"Diet.high.in.sodium",
"Diet.low.in.vegetables",
"Diet.low.in.nuts.and.seeds", and
"Diet.low.in.whole.grains" have strong correlations, with
coefficients from 0.81 to 0.92. This can be grouped under “Dietary”
Category.chart.Correlation(df_clean[, c("Diet.high.in.sodium","Diet.low.in.whole.grains", "Diet.low.in.fruits","Diet.low.in.nuts.and.seeds", "Diet.low.in.Vegetables")], histogram=TRUE)
Unsafe.water
"No.access.to.handwashing.facility",
"Unsafe.water.source" and "Unsafe.sanitation"
have perfect correlations, with coefficients from 0.99 to 1. This can be
grouped under “Unsafe.water” Category.chart.Correlation(df_clean[, c("No.access.to.handwashing.facility","Unsafe.water.source", "Unsafe.sanitation")], histogram=TRUE)
Breastfeeding:
"Non.exclusive.breastfeeding" and
"Discontinued.breastfeeding" have a high correlation of
0.95. This can be grouped under “Breastfeeding” Category.
chart.Correlation(df_clean[, c("Non.exclusive.breastfeeding", "Discontinued.breastfeeding")], histogram=TRUE)
Tobacco: "Smoking" and
"Secondhand.smoke" have a high correlation of 0.98. This
can be grouped under “Tobacco” Category.
chart.Correlation(df_clean[, c("Smoking", "Secondhand.smoke")], histogram=TRUE)
Child.growth: "Child.wasting" and
"Child.stunting" have a high correlation of 0.98. This can
be grouped under “Child.growth:” Category.
chart.Correlation(df_clean[, c("Child.wasting", "Child.stunting")], histogram=TRUE)
The high correlation indicates potential redundancy. This process is to reduce redundancy and prepare a more focused dataset for later machine learning model.
Levels Explanation:
Level1: include broad categories including Behavioral, Environmental, and Metabolic groups. It aligns IHME’s highest classification level and helps to identify the major types of influences.
Level2: contains 17 risk factors. Level 3 variables that have similar influences or are highly correlated could be merged into Level2 groups. This step plays a significant role in reducing the number of features while maintaining meaningful information.
Level 3: contains 27 factors (“Air.pollution” is removed due to its redundancy)
Remove “Air.pollution"
It was redundant as it represents as combined data of the
Household.air.pollution.from.solid.fuels and
Outdoor.air.pollution
causesVar <- colnames(df_clean)[4:31]
# Level 3
df_new <- df_clean[,!names(df_clean) %in% "Air.pollution" ]
# combined into new variables
df_new_1 <- df_new %>%
mutate(
Breastfeeding = (Non.exclusive.breastfeeding + Discontinued.breastfeeding),
Child.growth = (Child.wasting + Child.stunting),
Tobacco = (Smoking + Secondhand.smoke),
Dietary = (Diet.high.in.sodium + Diet.low.in.whole.grains + Diet.low.in.fruits +
Diet.low.in.nuts.and.seeds + Diet.low.in.Vegetables),
Unsafe.water = (No.access.to.handwashing.facility + Unsafe.water.source +
Unsafe.sanitation),
Air.pollution = (Household.air.pollution.from.solid.fuels + Outdoor.air.pollution))
# after aggregation, individual variables were removed
df_new_1 <- subset(df_new_1, select = -c(Non.exclusive.breastfeeding,Discontinued.breastfeeding, Child.wasting, Child.stunting, Smoking, Secondhand.smoke, Diet.high.in.sodium, Diet.low.in.whole.grains, Diet.low.in.fruits, Diet.low.in.nuts.and.seeds, Diet.low.in.Vegetables,No.access.to.handwashing.facility, Unsafe.water.source, Unsafe.sanitation, Household.air.pollution.from.solid.fuels, Outdoor.air.pollution ))
# convert to long format
df_long <- pivot_longer(df_clean,cols = causesVar,names_to = "Level3",values_to = "Deaths")
causesVar <- causesVar[causesVar != "Air.pollution"] # remove Air.pollution
df_long$Level2 <- factor(df_long$Level3)
levels(df_long$Level2) <- list(
"Breastfeeding" = c("Non.exclusive.breastfeeding", "Discontinued.breastfeeding"),
"Child.growth" = c("Child.wasting", "Child.stunting"),
"Low.birth.weight" = "Low.birth.weight",
"Iron.deficiency" = "Iron.deficiency",
"Vitamin.A.deficiency" = "Vitamin.A.deficiency",
"Drug.use" = "Drug.use",
"Alcohol.use" = "Alcohol.use",
"Unsafe.sex" = "Unsafe.sex",
"Low.physical.activity" = "Low.physical.activity",
"Tobacco" = c("Smoking", "Secondhand.smoke"),
"Dietary" = c("Diet.high.in.sodium","Diet.low.in.whole.grains", "Diet.low.in.fruits","Diet.low.in.nuts.and.seeds", "Diet.low.in.Vegetables"),
"Unsafe.water" = c("No.access.to.handwashing.facility","Unsafe.water.source", "Unsafe.sanitation"),
"Air.pollution" = c("Household.air.pollution.from.solid.fuels", "Outdoor.air.pollution"),
# Air.pollution: NA (ignore)
"High.systolic.blood.pressure" = "High.systolic.blood.pressure",
"High.fasting.plasma.glucose" = "High.fasting.plasma.glucose",
"High.body.mass.index" = "High.body.mass.index",
"Low.bone.mineral.density" = "Low.bone.mineral.density"
)
df_long$Level1 <- factor(df_long$Level2)
levels(df_long$Level1) <- list(
"Behavioral" = c("Breastfeeding", "Child.growth","Low.birth.weight","Iron.deficiency",
"Vitamin.A.deficiency", "Drug.use", "Alcohol.use", "Unsafe.sex", "Low.physical.activity",
"Tobacco", "Dietary"),
"Environmental" = c("Unsafe.water", "Air.pollution"),
"Metabolic" = c("High.systolic.blood.pressure", "High.fasting.plasma.glucose", "High.body.mass.index", "Low.bone.mineral.density")
)
df_long <- df_long[df_long$Level3 != "Air.pollution", ]
Death Rate: calculated for each country and year, based on total deaths (sum from the causes) divided by the population
factors <- colnames(df_new_1)[!colnames(df_new_1) %in% c("Country", "Code", "Year","GDP", "Population","Continent")]
df_new_1$Total_Deaths <- rowSums(df_new_1[, factors])
df_new_1 <- df_new_1 %>%
mutate(Death_Rate = (Total_Deaths / Population) * 100000)
Death Rate Normalized: calculated as the death rate divided by GDP. It is used to adjust the death rate by each country’s economic status.
df_new_1 <- df_new_1 %>%
mutate(Death_Rate_Normalized = Death_Rate / GDP)
Log transformation
Due to the high skewness of deaths found in visual analysis (histograms and boxplots). Log transformation was appied to reduce the skewness and improve data normality.
log_trans <- function(df, vars) {
for (var in vars) {
df[,var] <- log(df[,var] + 1)
}
return(df)
}
df_new_log <- log_trans(df_new_1, factors)
Scaling
After log transformation, these feature variables were scaled to standardize to ensure compatibility for machine learning models.
df_new_2_scaled <- as.data.frame(scale(df_new_log[, factors]))
df_new_2 <- cbind(df_new_log[, setdiff(names(df_new_log), factors)], df_new_2_scaled )
processed_dir <- "../data/processed"
# ensure directory exists
dir.create(processed_dir, recursive = TRUE, showWarnings = FALSE)
write.csv(
df_long,
file.path(processed_dir, "death_long.csv"),
row.names = FALSE
)
write.csv(
df_new_1,
file.path(processed_dir, "death_original.csv"),
row.names = FALSE
)
write.csv(
df_new_log,
file.path(processed_dir, "death_log.csv"),
row.names = FALSE
)
This section selects several key plots from Shiny app to highlight important findings. Please refer to the Shiny app for more comprehensive visualizations.
The datasets used in this section were prepared for the accompanying Shiny application and are reused here for static visualization.
format_numbers <- function(num) {
sapply(num, function(x) {
if (x >= 1e9) {
paste(format(round(x / 1e9, 2), nsmall = 2), "B")
} else if (x >= 1e6) {
paste(format(round(x / 1e6, 2), nsmall = 2), "M")
} else if (x >= 1e3) {
paste(format(round(x / 1e3, 2), nsmall = 2), "K")
} else {
as.character(x)
}
})
}
df_shiny <- read.csv(file.path(processed_dir, "dShiny.csv"))
df_shiny$Year <- factor(df_shiny$Year)
df_shiny_long <- read.csv(file.path(processed_dir, "dShiny_long.csv"))
df_shiny_long$Year <- factor(df_shiny_long$Year)
# bar Plot for Total Deaths by Country
cols <- c("Africa" = "#1B9E77", "Europe" = "#7570B3", "Asia" = "#D95F02",
"Americas" = "#E7298A", "Oceania" = "#66A61E")
df_plot_1 <- aggregate(df_shiny[, "Total_Deaths"], by = list(df_shiny[, "Country"]), FUN = sum)
names(df_plot_1) <- c("Country", "Total_Deaths")
df_plot_1 <- df_plot_1 %>%
left_join(df_shiny %>% distinct(Country, Continent), by = c("Country" = "Country"))
df_plot_1 <- df_plot_1[order(-df_plot_1$Total_Deaths), ]
df_plot_1 <- head(df_plot_1, 20)
p9 <- ggplot(df_plot_1, aes(x = reorder(Country, Total_Deaths), y = Total_Deaths, fill = Continent)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = cols) +
labs(title = "Total Deaths by Country", x = "", y = "Total Deaths", fill = "Continent") +
geom_text(aes(label = format_numbers(Total_Deaths)), hjust = -0.1, size = 3) +
scale_y_continuous(limits = c(0, max(df_plot_1$Total_Deaths) * 1.2)) +
plot_theme() +
theme(legend.position = "bottom")
df_plot_2 <- aggregate(df_shiny[, "Death_Rate"], by = list(df_shiny[, "Country"]), FUN = sum)
names(df_plot_2) <- c("Country", "Death_Rate")
df_plot_2 <- df_plot_2 %>%
left_join(df_shiny %>% distinct(Country, Continent), by = c("Country" = "Country"))
df_plot_2 <- df_plot_2[order(-df_plot_2$Death_Rate), ]
df_plot_2 <- head(df_plot_2, 20)
p10 <- ggplot(df_plot_2, aes(x = reorder(Country, Death_Rate), y = Death_Rate, fill = Continent)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = cols) +
labs(title = "Total Deaths by Country", x = "", y = "Death_Rate", fill = "Continent") +
geom_text(aes(label = format_numbers(Death_Rate)), hjust = -0.1, size = 3) +
scale_y_continuous(limits = c(0, max(df_plot_2$Death_Rate) * 1.2)) +
plot_theme() +
theme(legend.position = "bottom")
grid.arrange(p9, p10, ncol=2)
Bar Plot of Total Deaths by Country: shows the top 20 countries with the highest total deaths. China and India have the most deaths, primarily due to their large populations. Asian countries are highly represented in the top ranks.
Bar Plot of Death Rate by Country: shows the death rate for the top 20 countries. Unlike total deaths, the highest death rates are observed in European countries. This indicates that high death rates are not only due to population size but also other contributing factors.
df_plot_3 <- df_shiny
p11 <- ggplot(df_plot_3, aes(x = Continent, y = Total_Deaths, fill = Continent)) +
geom_boxplot() +
coord_flip() +
labs(title = paste("Total_Deaths by Continent"),x = "", y = "Total_Deaths") +
plot_theme() +
theme(legend.position = "bottom")
df_plot_4 <- df_shiny
df_plot_4$Total_Deaths <- log(df_plot_4$Total_Deaths)
p12 <- ggplot(df_plot_4, aes(x = Continent, y = Total_Deaths, fill = Continent)) +
geom_boxplot() +
coord_flip() +
labs(title = paste("Total_Deaths by Continent"),x = "", y = "Total_Deaths") +
plot_theme() +
theme(legend.position = "bottom")
grid.arrange(p11, p12, ncol=2)
Box Plot of Total Deaths by Continent: shows the distribution of total deaths by continent. Asia shows a wide range with many outliers, which reflects countries like China and India having extremely high numbers of deaths compared to other countries.
Box Plot of Total Deaths by Continent(Log Transformed): After log transformation, the outliers become less obvious, and the data is more centralized.
df_plot_5 <- df_shiny_long
cols <- c("Behavioral" = "#1B9E77", "Environmental" = "#7570B3", "Metabolic" = "#D95F02")
df_plot_5 <- aggregate(Deaths ~ Cause + Category, df_plot_5, sum)
p13 <- ggplot(df_plot_5, aes(x = reorder(Cause, Deaths), y = Deaths, fill = Category)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = cols) +
geom_text(aes(label = format_numbers(Deaths)), hjust = -0.1, size = 3) +
labs(title = paste("Total Deaths by Cause for"), fill = "Category",x = "", y = "Total deaths") +
scale_y_continuous(limits = c(0, max(df_plot_5$Deaths) * 1.2)) +
plot_theme() +
theme(legend.position = "bottom")
df_plot_6 <- df_shiny_long
df_plot_6 <- aggregate(Deaths ~ Category, df_plot_6, sum)
p14 <- ggplot(df_plot_6, aes(x = reorder(Category, Deaths), y = Deaths, fill = Category)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = cols) +
geom_text(aes(label = format_numbers(Deaths)), hjust = -0.1, size = 3) +
labs(title = paste("Total Deaths by Category"),x = "", y = "Total deaths") +
scale_y_continuous(limits = c(0, max(df_plot_6$Deaths) * 1.2)) +
plot_theme() +
theme(legend.position = "bottom")
grid.arrange(p13, p14, ncol=2)
Bar Plot of Total Deaths by Cause: shows the total number of deaths attributed to 17 specific causes. High Systolic Blood Pressure is the leading cause for 255.41 million deaths. Tobacco use and Air Pollution also has a significant impact. Other major causes include Dietary factors, High Fasting Plasma Glucose, and Unsafe Water.
Bar Plot of Total Deaths by Category: summarizes deaths by three major categories: Behavioral, Environmental, and Metabolic risks. Behavioral risks has the highest number of deaths, with approximately 677.32 million deaths, suggesting that changing health behaviors (such as reducing tobacco use or improving diet quality) could significantly lower global death rates. Metabolic risks are the next major contributor, with around 499.52 million deaths. Environmental risks, while comparatively lower, still has 321.27 million deaths, indicating the importance of addressing environmental health concerns.
df_plot_7 <- df_shiny
cols <- c("Africa" = "#1B9E77", "Europe" = "#7570B3", "Asia" = "#D95F02",
"Americas" = "#E7298A", "Oceania" = "#66A61E")
df_plot_7$Behavioral <- log(df_plot_7$Behavioral)
df_plot_7$Metabolic <- log(df_plot_7$Metabolic)
ggplot(df_plot_7, aes(x = Behavioral, y = Metabolic, color = Continent)) +
geom_point(alpha = 0.25) +
labs(title = paste("Behavioral and Metabolic"), x = "Behavioral", y = "Metabolic") +
plot_theme() +
scale_color_manual(values = cols) +
theme(legend.position = "bottom")
Scatterplot: shows the relationship between Behavioral and Metabolic risk factors across different continents.(Log transformed). There is a strong positive correlation between behavioral and metabolic risks, indicating that countries with higher behavioral risk factors (e.g., tobacco use, poor diet) also tend to have higher metabolic risk factors (e.g., high blood pressure, high BMI). The color indicates that the trend holds globally with slight variations among regions. This positive correlation suggests that changing behavioral risk factors could help reduce metabolic risks as well.
This section focuses on building machine learning classification models to predict the death rate (high/low) for each country. Decision Tree is used to understand the feature importance.KNN is used to identify the optimal k value and compare its performance with the Decision Tree. LIME is to interpret feature contributions and shows the model’s transparency.
Three features sets are selected to compare the performance of two classifiers:
Feature set1: 17 log-scaled variables (refer to Section 2.4.2 Data Transformation)
Feature set2: 12 selected variables obtained from the single variable models (refer to Section 3.2 Single Variable Modeling)
Feature set3: 9 important variables obtained from the PCA. (refer to Section 3.3 PCA)
Disclaimer: The implementation in Chapter 3
(Classification) and Chapter 4 (Clustering) is based on lecture
materials and lab exercises. ChatGPT-4 was used as a learning aid to
clarify concepts, understand R library usage, and support code
interpretation (e.g., PCA implementation and KNN hyperparameter tuning
with train()).
Death Rate Category
The target variable is created by using the normalized death rate and the binary classification is based on whether the value is above or below the median.
df_new_2 <- mutate(df_new_2, Death_Rate_Cat = ifelse(Death_Rate_Normalized > median(Death_Rate_Normalized),"1","0"))
d <- subset(df_new_2, select = -c(GDP, Population, Total_Deaths, Death_Rate, Death_Rate_Normalized))
outcome <- 'Death_Rate_Cat'
pos <- '1'
death_cat_avg <- aggregate(Death_Rate_Cat ~ Country, df_new_2, FUN = function(x) sum(x == '1') / 30)
The dataset is split into training, calibration, and testing subsets.
# split data
set.seed(4009)
d$rgroup <- runif(nrow(d))
dTrainAll <- subset(d, rgroup<=0.9)
dTest <- subset(d, rgroup > 0.9)
useForCal <- rbinom(n=dim(dTrainAll)[[1]], size=1, prob=0.1) > 0
dCal <- subset(dTrainAll, useForCal)
dTrain <- subset(dTrainAll,!useForCal)
cat('Training, calibration and testing set sizes are:', nrow(dTrain), ',', nrow(dCal), 'and', nrow(dTest))
## Training, calibration and testing set sizes are: 4840 , 508 and 622
model_data_dir <- "../data/model_inputs"
if (!dir.exists(model_data_dir)) {
dir.create(model_data_dir, recursive = TRUE)
}
write.csv(dTrain, file.path(model_data_dir, "dTrain.csv"), row.names = FALSE)
write.csv(dCal, file.path(model_data_dir, "dCal.csv"), row.names = FALSE)
write.csv(dTest, file.path(model_data_dir, "dTest.csv"), row.names = FALSE)
write.csv(df_new_2, file.path(model_data_dir, "dTransformed.csv"), row.names = FALSE)
The features are divided into categorical and numerical variables. The 17 log-scaled numeric variables is used as feature set1.
vars <- setdiff(colnames(dTrainAll), c(outcome,'rgroup','Country'))
catVars <- vars[sapply(dTrainAll[,vars], class) %in% c('factor','character')]
catVars
## [1] "Code" "Year" "Continent"
numericVars <- vars[sapply(dTrainAll[,vars], class) %in% c('numeric','integer')]
numericVars
## [1] "High.systolic.blood.pressure" "Alcohol.use"
## [3] "Low.birth.weight" "Unsafe.sex"
## [5] "Low.physical.activity" "High.fasting.plasma.glucose"
## [7] "High.body.mass.index" "Drug.use"
## [9] "Low.bone.mineral.density" "Vitamin.A.deficiency"
## [11] "Iron.deficiency" "Breastfeeding"
## [13] "Child.growth" "Tobacco"
## [15] "Dietary" "Unsafe.water"
## [17] "Air.pollution"
This section evaluates the predictive power of individual variables before combining them into multivariate models, with the goal of identifying strong standalone predictors of high vs. low death rates.
The null model is used as a baseline to evaluate the performance of other models. It predicts the death rate (high/low) based on the proportion of positive cases in the training data. The log likelihood and AUC are calculated for this null model for further comparisons.
logLikelihood <- function(ypred, ytrue,epsilon=1e-6) {
sum(ifelse(ytrue, log(ypred+epsilon), log(1-ypred+epsilon)), na.rm=T)
}
(Npos <- sum(dTrain[,outcome] == 1))
## [1] 2433
pred.Null <- Npos / nrow(dTrain)
cat("Proportion of outcome == 1 in dTrain:", pred.Null)
## Proportion of outcome == 1 in dTrain: 0.502686
calcAUC <- function(ypred, ytrue) {
perf <- performance(prediction(ypred, ytrue), 'auc')
as.numeric(perf@y.values)
}
TP <- 0
TN <- sum(dCal[, outcome] == "0")
FP <- 0
FN <- sum(dCal[, outcome] == "1")
cat("nrow(dCal):", nrow(dCal), "TP:", TP, "TN:", TN, "FP:", FP, "FN:", FN)
## nrow(dCal): 508 TP: 0 TN: 270 FP: 0 FN: 238
(accuracy <- (TP + TN) / nrow(dCal))
## [1] 0.5314961
(precision <- TP/(TP + FP))
## [1] NaN
pred.Null <- rep(pred.Null, nrow(dCal))
(AUC <- calcAUC(pred.Null, dCal[,outcome]))
## [1] 0.5
logNull <- logLikelihood(sum(dCal[,outcome]==pos)/nrow(dCal), dCal[,outcome]==pos)
cat("The log likelihood of the Null model is:", logNull)
## The log likelihood of the Null model is: -351.1092
To examine the predictive ability of each categorical variable, single-variable models are created.
Prediction: For each categorical variable, a model
mkPredC() predicts the high death rate based on the levels
of that categorical variable.
Evaluation: Each model is examined using the Area Under the Curve (AUC) on both training and calibration sets. Variables that has significant reduction in deviance compared to the null models are selected.
plot_roc <- function(predcol, outcol, colour_id, label, overlaid=F) {
ROCit_obj <- rocit(score = predcol, class = outcol == "1")
par(new = overlaid)
plot(ROCit_obj, col = c(colour_id,"black"), main = "", xlab = "", ylab = "",
legend = FALSE, YIndex = FALSE, values = FALSE)
lines(ROCit_obj$fpr, ROCit_obj$tpr, col = colour_id, lwd = 2, type = "b")
}
# perform single variable prediction for a given categorical column
mkPredC <- function(outCol, varCol, appCol) {
pPos <- sum(outCol==pos)/length(outCol)
naTab <- table(as.factor(outCol[is.na(varCol)]))
pPosWna <- (naTab/sum(naTab))[pos]
vTab <- table(as.factor(outCol), varCol)
pPosWv <- (vTab[pos,]+1.0e-3*pPos)/(colSums(vTab)+1.0e-3)
pred <- pPosWv[appCol]
pred[is.na(appCol)] <- pPosWna
pred[is.na(pred)] <- pPos
pred
}
for (v in catVars) {
pi <- paste('pred', v, sep='')
dTrain[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dTrain[,v])
dCal[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dCal[,v])
dTest[,pi] <- mkPredC(dTrain[,outcome], dTrain[,v], dTest[,v])
}
for(v in catVars) {
pi <- paste('pred', v, sep='')
aucTrain <- calcAUC(dTrain[,pi], dTrain[,outcome])
if (aucTrain >= 0.1) {
aucCal <- calcAUC(dCal[,pi], dCal[,outcome])
print(sprintf(
"%s: trainAUC: %4.3f; calibrationAUC: %4.3f",
pi, aucTrain, aucCal))
}
}
## [1] "predCode: trainAUC: 0.973; calibrationAUC: 0.965"
## [1] "predYear: trainAUC: 0.620; calibrationAUC: 0.612"
## [1] "predContinent: trainAUC: 0.776; calibrationAUC: 0.776"
selCatVars <- c()
minDrop <- 30
for (v in catVars) {
pi <- paste('pred', v, sep='')
devDrop <- 2*(logLikelihood(dCal[,pi], dCal[,outcome]==pos) - logNull)
if (devDrop >= minDrop) {
cat(sprintf("%6s, deviance reduction: %g\n", v, devDrop))
selCatVars <- c(selCatVars, pi)
}
}
## Code, deviance reduction: 474.918
## Continent, deviance reduction: 139.232
AUC: predCode: Train AUC = 0.973, Calibration AUC = 0.965 predContinent: Train AUC = 0.776, Calibration AUC = 0.776
Deviance reduction: Code: Deviance reduction = 474.918 Continent: Deviance reduction = 139.232
Next step is to evaluate each numerical variable.
Prediction: function mkPredN() is used
to convert numerical variables into categorical bins, and the same
method mkPredC is used to predict the outcomes.
Evaluation: Same method as categorical variables (AUC and deviance reduction)
mkPredN <- function(outCol, varCol, appCol) {
cuts <- unique(as.numeric(
quantile(varCol, probs=seq(0, 1, 0.1), na.rm=T)))
varC <- cut(varCol, cuts)
appC <- cut(appCol, cuts)
mkPredC(outCol, varC, appC)
}
for (v in numericVars) {
pi <- paste('pred', v, sep='')
dTrain[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dTrain[,v])
dTest[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dTest[,v])
dCal[,pi] <- mkPredN(dTrain[,outcome], dTrain[,v], dCal[,v])
}
for(v in numericVars) {
pi <- paste('pred', v, sep='')
aucTrain <- calcAUC(dTrain[,pi], dTrain[,outcome])
if (aucTrain >= 0.1) {
aucCal <- calcAUC(dCal[,pi], dCal[,outcome])
print(sprintf(
"%s: trainAUC: %4.3f; calibrationAUC: %4.3f",
pi, aucTrain, aucCal))
}
}
## [1] "predHigh.systolic.blood.pressure: trainAUC: 0.635; calibrationAUC: 0.626"
## [1] "predAlcohol.use: trainAUC: 0.669; calibrationAUC: 0.679"
## [1] "predLow.birth.weight: trainAUC: 0.768; calibrationAUC: 0.795"
## [1] "predUnsafe.sex: trainAUC: 0.698; calibrationAUC: 0.691"
## [1] "predLow.physical.activity: trainAUC: 0.642; calibrationAUC: 0.633"
## [1] "predHigh.fasting.plasma.glucose: trainAUC: 0.626; calibrationAUC: 0.602"
## [1] "predHigh.body.mass.index: trainAUC: 0.641; calibrationAUC: 0.619"
## [1] "predDrug.use: trainAUC: 0.619; calibrationAUC: 0.620"
## [1] "predLow.bone.mineral.density: trainAUC: 0.670; calibrationAUC: 0.632"
## [1] "predVitamin.A.deficiency: trainAUC: 0.847; calibrationAUC: 0.858"
## [1] "predIron.deficiency: trainAUC: 0.819; calibrationAUC: 0.829"
## [1] "predBreastfeeding: trainAUC: 0.835; calibrationAUC: 0.843"
## [1] "predChild.growth: trainAUC: 0.799; calibrationAUC: 0.813"
## [1] "predTobacco: trainAUC: 0.641; calibrationAUC: 0.625"
## [1] "predDietary: trainAUC: 0.624; calibrationAUC: 0.611"
## [1] "predUnsafe.water: trainAUC: 0.796; calibrationAUC: 0.813"
## [1] "predAir.pollution: trainAUC: 0.710; calibrationAUC: 0.731"
selNumVars <- c()
minDrop <- 25
for (v in numericVars) {
pi <- paste('pred', v, sep='')
devDrop <- 2*(logLikelihood(dCal[,pi], dCal[,outcome]==pos) - logNull)
if (devDrop >= minDrop) {
cat(sprintf("%6s, deviance reduction: %g\n", v, devDrop))
selNumVars <- c(selNumVars, pi)
}
}
## Alcohol.use, deviance reduction: 50.0236
## Low.birth.weight, deviance reduction: 146.727
## Unsafe.sex, deviance reduction: 61.3217
## Low.physical.activity, deviance reduction: 26.8478
## Vitamin.A.deficiency, deviance reduction: 260.36
## Iron.deficiency, deviance reduction: 188.907
## Breastfeeding, deviance reduction: 205.372
## Child.growth, deviance reduction: 174.091
## Tobacco, deviance reduction: 28.3631
## Dietary, deviance reduction: 26.0176
## Unsafe.water, deviance reduction: 176.854
## Air.pollution, deviance reduction: 86.9477
(selVars <- c(selCatVars, selNumVars))
## [1] "predCode" "predContinent"
## [3] "predAlcohol.use" "predLow.birth.weight"
## [5] "predUnsafe.sex" "predLow.physical.activity"
## [7] "predVitamin.A.deficiency" "predIron.deficiency"
## [9] "predBreastfeeding" "predChild.growth"
## [11] "predTobacco" "predDietary"
## [13] "predUnsafe.water" "predAir.pollution"
col = c(brewer.pal(n = 8, name = "Dark2"),brewer.pal(n = 11, name = "RdBu"))
n = 1
labels = vector("character", length(selVars))
cols = vector("numeric", length(selVars))
for (v in selVars) {
plot_roc(dTest[, v], dTest[, outcome], colour_id = col[n], label = v, overlaid = T)
labels[n] <- v
cols[n] <- col[n]
n = n + 1
}
legend("bottomright", legend = labels, col = cols, lwd = 2, cex = 0.8, merge = TRUE)
title("ROC for Select Variables on the Test Set")
cat("Performance of the top performing single variables on the test set:")
## Performance of the top performing single variables on the test set:
for (v in selVars) {
# retrieve the original variable name (character location 5 onward)
orig_v <- substring(v, 5)
cat(sprintf("Variable %6s: AUC = %g\n", orig_v, calcAUC(dTest[,v], dTest[,outcome]==pos)))
}
## Variable Code: AUC = 0.969781
## Variable Continent: AUC = 0.772557
## Variable Alcohol.use: AUC = 0.652572
## Variable Low.birth.weight: AUC = 0.797192
## Variable Unsafe.sex: AUC = 0.69343
## Variable Low.physical.activity: AUC = 0.62712
## Variable Vitamin.A.deficiency: AUC = 0.862685
## Variable Iron.deficiency: AUC = 0.840485
## Variable Breastfeeding: AUC = 0.860752
## Variable Child.growth: AUC = 0.822845
## Variable Tobacco: AUC = 0.61891
## Variable Dietary: AUC = 0.603808
## Variable Unsafe.water: AUC = 0.81554
## Variable Air.pollution: AUC = 0.746417
The ROC curve for each selected variable to visualize their predictivity on the test set. In the numerical variables, the following are the best-performing ones based on the AUC: Vitamin A deficiency: AUC = 0.8627 Iron deficiency: AUC = 0.8405 Breastfeeding: AUC = 0.8608 Child growth: AUC = 0.8228 Unsafe water: AUC = 0.8155 These variables show higher AUC values compared to others. Among these, Vitamin A deficiency has the highest AUC, making it the best individual numerical predictor.
Overall, both categorical (Code, Continent) and selected numerical risk factors demonstrate strong standalone predictive power. These variables are carried forward into Feature Set 2 for multivariate modeling.
PCA is applied to assist feature selection by identifying dominant patterns among highly correlated numerical variables and reducing redundancy prior to multivariate modeling. By using PCA, the aim is to reduce the dimensionality of the dataset and maintain the maximum amount of variance.
prcomp() to perform
PCA on scaled numerical variables. PCA transforms the original features
into new componentsNote: The PCA implementation was adapted with assistance from ChatGPT-4 to clarify the feature selection logic and loading-based interpretation.
pca_result <- prcomp(d[, factors], scale. = TRUE)
loadings_matrix <- pca_result$rotation
important_features <- c()
cumulative_threshold <- 0.9
for (i in 1:ncol(loadings_matrix)) {
cumulative_variance <- sum((pca_result$sdev[1:i])^2) / sum((pca_result$sdev)^2)
if (cumulative_variance > cumulative_threshold) break
pc <- paste0("PC", i)
important_features <- unique(c(important_features, rownames(loadings_matrix)[abs(loadings_matrix[, pc]) > 0.25]))
}
important_features
## [1] "High.systolic.blood.pressure" "Alcohol.use"
## [3] "Low.birth.weight" "High.fasting.plasma.glucose"
## [5] "Drug.use" "Low.bone.mineral.density"
## [7] "Tobacco" "Dietary"
## [9] "Air.pollution"
vars1 <- numericVars
vars2 <- selNumVars
vars3 <- important_features
cat('Number of features used in tmodel1 is:', length(vars1))
## Number of features used in tmodel1 is: 17
## Number of features used in tmodel1 is: 17
cat('Number of features used in tmodel2 is:', length(vars2))
## Number of features used in tmodel2 is: 12
## Number of features used in tmodel2 is: 12
cat('Number of features used in tmodel4 is:', length(vars3))
## Number of features used in tmodel4 is: 9
## Number of features used in tmodel4 is: 9
This section evaluates the performance of a Decision Tree classifier using three different feature sets introduced earlier.
Functions set up for model implementation and evaluation.
panderOpt <- function(){
library(pander)
panderOptions("plain.ascii", TRUE)
panderOptions("keep.trailing.zeros", TRUE)
panderOptions("table.style", "simple")
}
performanceMeasures <- function(ytrue, ypred, model.name = "model", threshold=0.5) {
auc <- calcAUC(ypred, ytrue)
dev.norm <- -2 * logLikelihood(ytrue, ypred)/length(ypred)
cmat <- table(actual = ytrue, predicted = ypred >= threshold)
accuracy <- sum(diag(cmat)) / sum(cmat)
precision <- cmat[2, 2] / sum(cmat[, 2])
recall <- cmat[2, 2] / sum(cmat[2, ])
f1 <- 2 * precision * recall / (precision + recall)
list(perf = data.frame(model = model.name, auc = auc, precision = precision,
recall = recall, f1 = f1, dev.norm = dev.norm),
confusion_matrix = cmat)
}
pretty_perf_table <- function(model, xtrain, ytrain,
xcalib, ycalib, xtest, ytest, threshold=0.5) {
panderOpt()
perf_justify <- "lrrrrr"
pred_train <- predict(model, newdata=xtrain)
pred_calib <- predict(model, newdata=xcalib)
pred_test <- predict(model, newdata=xtest)
trainperf_df <- performanceMeasures(
ytrain, pred_train, model.name="training", threshold=threshold)
calibperf_df <- performanceMeasures(
ycalib, pred_calib, model.name="calibration", threshold=threshold)
testperf_df <- performanceMeasures(
ytest, pred_test, model.name="test", threshold=threshold)
perftable <- rbind(trainperf_df$perf, calibperf_df$perf, testperf_df$perf)
pandoc.table(perftable, justify = perf_justify)
}
pretty_test <- function(model, xtest, ytest, threshold=0.5) {
panderOpt()
perf_justify <- "lrrrrr"
pred_test <- predict(model, newdata=xtest)
testperf <- performanceMeasures(ytest, pred_test, model.name="test", threshold=threshold)
pandoc.table(testperf$perf, justify = perf_justify)
}
pretty_cmat <- function(model, xtest, ytest, threshold=0.5) {
pred_test <- predict(model, newdata=xtest)
testperf <- performanceMeasures(ytest, pred_test, model.name="test", threshold=threshold)
pandoc.table(testperf$confusion_matrix, caption = "Confusion Matrix for Test Set")
}
# PLOTS
# single model ROC
plot_roc_model <- function(model, dTest, dTrain, dCal, outcome, pos){
pred_test_roc <- predict(model, newdata = dTest)
pred_train_roc <- predict(model, newdata = dTrain)
pred_cal_roc <- predict(model, newdata = dCal)
roc_1 <- rocit(score = pred_test_roc, class = dTest[[outcome]] == pos)
roc_2 <- rocit(score = pred_train_roc, class = dTrain[[outcome]] == pos)
roc_3 <- rocit(score = pred_cal_roc, class = dCal[[outcome]] == pos)
cols = brewer.pal(n = 3, name = "Dark2")
plot(roc_1, col = c(cols[1],"black"), lwd = 3, legend = FALSE, YIndex = FALSE, values = TRUE, asp = 1)
lines(roc_2$TPR ~ roc_2$FPR, lwd = 3, col = c(cols[2],"black"), asp = 1)
lines(roc_3$TPR ~ roc_3$FPR, lwd = 3, col = c(cols[3],"black"), asp = 1)
legend("bottomright", col = c(cols, "black"),
legend = c("Test Data", "Training Data", "Calibration Data", "Null Model"), lwd = 2)
title("ROC for Model on the Test, training and Calibration Data")
}
# multiple models ROC
plot_roc_new <- function(ypred, ytrue, legend.text="", ...) {
cols = brewer.pal(n = 4, name = "Dark2")
colour_id <- 1
roc <- rocit(score=ypred, class=ytrue)
plot(roc, col = c(cols[colour_id], "black"), lwd = 3,
legend = FALSE, YIndex = FALSE, values = TRUE, asp = 1,
cex.lab=3, cex.axis=2, cex.main=3)
# Process additional models
additional_args <- list(...)
num_args <- length(additional_args)
for (i in seq(1, num_args, by = 2)) {
colour_id <- colour_id + 1
ypredi <- additional_args[[i]]
ytruei <- additional_args[[i + 1]]
roc <- rocit(score=ypredi, class=ytruei)
lines(roc$TPR ~ roc$FPR, lwd = 3,
col = c(cols[colour_id], "black"), asp = 1,
cex.lab=3, cex.axis=2, cex.main=3)
}
if (!any(legend.text == "")) {
legend("bottomright", col = c(cols[1:colour_id],"black"),
c(legend.text, "Null model"), lwd = 2)
}
}
tmodel <- rpart(as.formula(paste("Death_Rate_Cat >0 ~", paste(vars1, collapse = " + "))), data = dTrain)
rpart.plot(tmodel)
# summary(tmodel)
pretty_perf_table(tmodel, dTrain[vars1], dTrain[,outcome]==pos,
dCal[vars1], dCal[,outcome]==pos,
dTest[vars1], dTest[,outcome]==pos)
##
##
## model auc precision recall f1 dev.norm
## ------------- -------- ----------- -------- -------- ----------
## training 0.9433 0.8939 0.9071 0.9004 12.16
## calibration 0.9289 0.8761 0.8613 0.8686 13.60
## test 0.9410 0.8942 0.8885 0.8914 11.99
plot_roc_model(tmodel, dTest, dTrain, dCal, outcome, pos)
Feature Set 1’s Decision Tree performs well with high AUC, precision, and recall across training, calibration, and test sets, indicating good generalization and model stability.
tmodel2 <- rpart(as.formula(paste("Death_Rate_Cat >0 ~", paste(vars2, collapse = " + "))), data = dTrain)
rpart.plot(tmodel2)
pretty_perf_table(tmodel2, dTrain[vars2], dTrain[,outcome]==pos,
dCal[vars2], dCal[,outcome]==pos,
dTest[vars2], dTest[,outcome]==pos)
##
##
## model auc precision recall f1 dev.norm
## ------------- -------- ----------- -------- -------- ----------
## training 0.8997 0.8766 0.8002 0.8367 13.74
## calibration 0.8836 0.8732 0.7521 0.8081 14.69
## test 0.8972 0.8938 0.7771 0.8313 13.68
plot_roc_model(tmodel2, dTest, dTrain, dCal, outcome, pos)
Feature set 2 had an AUC of 0.897 on the test set. The model performed well but was slightly less accurate compared to feature set 1.
tmodel3 <- rpart(as.formula(paste("Death_Rate_Cat >0 ~", paste(vars3, collapse = " + "))), data = dTrain)
rpart.plot(tmodel3)
pretty_perf_table(tmodel3, dTrain[vars3], dTrain[,outcome]==pos,
dCal[vars3], dCal[,outcome]==pos,
dTest[vars3], dTest[,outcome]==pos)
##
##
## model auc precision recall f1 dev.norm
## ------------- -------- ----------- -------- -------- ----------
## training 0.9325 0.9243 0.8434 0.8820 11.85
## calibration 0.9175 0.8688 0.8067 0.8366 12.78
## test 0.9353 0.9173 0.8121 0.8615 11.24
plot_roc_model(tmodel3, dTest, dTrain, dCal, outcome, pos)
Feature set 3 achieves a strong balance between predictive performance and model efficiency, using fewer variables while maintaining comparable AUC and precision to the full feature set.
pretty_test(tmodel, dTest[vars1], dTest[,outcome]==pos)
##
##
## model auc precision recall f1 dev.norm
## ------- ------- ----------- -------- -------- ----------
## test 0.941 0.8942 0.8885 0.8914 11.99
tmodel.pred <- predict(tmodel, newdata=dTest[vars1], )
tmodel2.pred <- predict(tmodel2, newdata=dTest[vars2])
tmodel3.pred <- predict(tmodel3, newdata=dTest[vars3])
dTest.gt <- dTest[,outcome] == pos
plot_roc_new(legend.text=c("tmodel", "tmodel2", "tmodel3"),
tmodel.pred, dTest.gt, tmodel2.pred, dTest.gt,
tmodel3.pred, dTest.gt)
Tmodel1 and tmodel3 perform better than tmodel2. Tmodel3 is efficient, using only 9 variables compared to tmodel1’s 17, while achieving similar results.
print(tmodel$variable.importance)
## Vitamin.A.deficiency Unsafe.water
## 508.07520 433.95530
## Breastfeeding Child.growth
## 415.39806 399.36625
## Iron.deficiency Low.birth.weight
## 377.58096 334.93455
## Low.bone.mineral.density High.systolic.blood.pressure
## 136.13020 128.05937
## Dietary Tobacco
## 126.34042 106.40429
## Low.physical.activity High.fasting.plasma.glucose
## 98.83086 95.72644
## High.body.mass.index Air.pollution
## 90.05734 84.86580
## Unsafe.sex Alcohol.use
## 59.27832 57.82802
## Drug.use
## 17.09990
print(tmodel2$variable.importance)
## predVitamin.A.deficiency predBreastfeeding predUnsafe.water
## 499.717528 407.844293 380.479887
## predChild.growth predLow.birth.weight predIron.deficiency
## 374.287190 354.234563 337.754650
## predAir.pollution predAlcohol.use predUnsafe.sex
## 45.304444 45.158097 23.014513
## predTobacco predDietary predLow.physical.activity
## 17.571915 17.145594 5.920737
print(tmodel3$variable.importance)
## Low.birth.weight High.systolic.blood.pressure
## 394.6867 360.6013
## Air.pollution Low.bone.mineral.density
## 358.1667 330.2818
## Tobacco High.fasting.plasma.glucose
## 311.5248 302.7744
## Dietary Alcohol.use
## 245.2245 150.3903
## Drug.use
## 138.8012
From the variable importance of three different feature sets:
Another classifier K-Nearest Neighbors (KNN) is usred to compare with the decision tree models. The KNN classifier can provide another approach to predict the death rate category for each country.
Functions set up for model implementation and evaluation.
knnPredict <- function(df, knnTrain, knnCl, k) {
knnDecision <- knn(knnTrain, df, knnCl, k=k, prob=TRUE)
ifelse(knnDecision == TRUE,
attributes(knnDecision)$prob,
1 - attributes(knnDecision)$prob)
}
compare_perf_table <- function(model1, x, y, knnTrain, knnCl, nK, model1_name="Model 1", threshold=0.5) {
panderOpt()
perf_justify <- "lrrrrr"
pred1 <- predict(model1, newdata=x)
pred2 <- knnPredict(x, knnTrain, knnCl, nK)
perf1 <- performanceMeasures(y, pred1, model.name=model1_name, threshold=threshold)
perf2 <- performanceMeasures(y, pred2, model.name="KNN Model", threshold=threshold)
perftable <- rbind(perf1$perf, perf2$perf)
pandoc.table(perftable, justify = perf_justify, caption = "Performance Comparison for Two Models")
}
compare_cmat <- function(model1, x, y, knnTrain, knnCl, nK, model1_name="Model 1", threshold=0.5) {
panderOpt()
pred1 <- predict(model1, newdata=x)
pred2 <- knnPredict(x, knnTrain, knnCl, nK)
cmat1 <- performanceMeasures(y, pred1, model.name=model1_name, threshold=threshold)$confusion_matrix
cmat2 <- performanceMeasures(y, pred2, model.name="KNN Model", threshold=threshold)$confusion_matrix
pandoc.table(cmat1, caption = paste("Confusion Matrix for", model1_name))
pandoc.table(cmat2, caption = "Confusion Matrix for KNN Model")
}
To determine the best value of k for each feature set, the train() is used to test k from range 1:10 by cross-validation to find the best value. Note: ChatGPT was used to understand the implementation of the train() function.
# feature sets1
dTrain_sub1 <- dTrain[, c(vars1, outcome)]
knnModel_1 <- train(
Death_Rate_Cat ~ .,
data = dTrain_sub1,
method = "knn",
trControl = trainControl(method = "cv"),
tuneGrid = data.frame(k = 1:10)
)
best_k1 <- knnModel_1$bestTune$k
# feature sets2
dTrain_sub2 <- dTrain[, c(vars2, outcome)]
knnModel_2 <- train(
Death_Rate_Cat ~ .,
data = dTrain_sub2,
method = "knn",
trControl = trainControl(method = "cv"),
tuneGrid = data.frame(k = 1:10)
)
best_k2 <- knnModel_2$bestTune$k
# feature sets3
dTrain_sub3 <- dTrain[, c(vars3, outcome)]
knnModel_3 <- train(
Death_Rate_Cat ~ .,
data = dTrain_sub3,
method = "knn",
trControl = trainControl(method = "cv"),
tuneGrid = data.frame(k = 1:10)
)
best_k3 <- knnModel_3$bestTune$k
print(paste("Best k value for vars1:", best_k1))
## [1] "Best k value for vars1: 1"
print(paste("Best k value for vars2", best_k2))
## [1] "Best k value for vars2 1"
print(paste("Best k value for vars3", best_k3))
## [1] "Best k value for vars3 1"
Although cross-validation selected k = 1 for all three feature sets, this result suggests that the feature space is highly separable. However, k = 1 is known to be sensitive to noise and may lead to overfitting. Therefore, KNN results are interpreted as a benchmark comparison rather than a production-ready model.
Use the best k = 1 to build three KNN models for three feature sets.
nK <- 1
knn1.pred <- knnPredict(dTest[vars1], dTrain[vars1], dTrain[, outcome] == pos, nK)
knn2.pred <- knnPredict(dTest[vars2], dTrain[vars2], dTrain[, outcome] == pos, nK)
knn3.pred <- knnPredict(dTest[vars3], dTrain[vars3], dTrain[, outcome] == pos, nK)
plot_roc_new(legend.text=c("knn1", "knn2", "knn3"),
knn1.pred, dTest.gt, knn2.pred, dTest.gt,
knn3.pred, dTest.gt)
The ROC plot above shows the performance of each model. All three KNN models show strong ROC performance on the test set. However, due to the use of k = 1, these results should be interpreted with caution.
compare_perf_table(tmodel, dTest[vars1], dTest[, outcome] == pos, dTrain[,vars1], dTrain[, outcome] == pos, nK)
##
##
## model auc precision recall f1 dev.norm
## ----------- -------- ----------- -------- -------- ----------
## Model 1 0.9410 0.8942 0.8885 0.8914 11.9942
## KNN Model 0.9807 0.9840 0.9777 0.9808 0.5331
##
## Table: Performance Comparison for Two Models
compare_cmat(tmodel, dTest[vars1], dTest[, outcome] == pos, dTrain[,vars1], dTrain[, outcome] == pos, nK)
##
##
## FALSE TRUE
## ----------- ------- ------
## FALSE 275 33
## TRUE 35 279
##
## Table: Confusion Matrix for Model 1
##
##
##
## FALSE TRUE
## ----------- ------- ------
## FALSE 303 5
## TRUE 7 307
##
## Table: Confusion Matrix for KNN Model
plot_roc_new(legend.text=c("tmodel1", "KNN Model1"),
tmodel.pred, dTest.gt,
knn1.pred, dTest.gt)
Feature Set 1: KNN achieves higher AUC, precision, recall, and F1-score than the Decision Tree on the test set. However, this performance gain should be interpreted with caution, as the KNN model uses k = 1 and is more sensitive to local noise.
compare_perf_table(tmodel2, dTest[vars2], dTest[, outcome] == pos, dTrain[,vars2], dTrain[, outcome] == pos, nK)
##
##
## model auc precision recall f1 dev.norm
## ----------- -------- ----------- -------- -------- ----------
## Model 1 0.8972 0.8938 0.7771 0.8313 13.682
## KNN Model 0.9673 0.9474 0.9745 0.9608 1.377
##
## Table: Performance Comparison for Two Models
compare_cmat(tmodel2, dTest[vars2], dTest[, outcome] == pos, dTrain[,vars2], dTrain[, outcome] == pos, nK)
##
##
## FALSE TRUE
## ----------- ------- ------
## FALSE 279 29
## TRUE 70 244
##
## Table: Confusion Matrix for Model 1
##
##
##
## FALSE TRUE
## ----------- ------- ------
## FALSE 291 17
## TRUE 8 306
##
## Table: Confusion Matrix for KNN Model
plot_roc_new(legend.text=c("tmodel2", "KNN Model2"),
tmodel2.pred, dTest.gt,
knn2.pred, dTest.gt)
Feature Set 2: KNN shows improved recall and F1-score compared to the Decision Tree, suggesting stronger sensitivity to positive cases. Nevertheless, the Decision Tree remains more interpretable and stable across feature reductions.
compare_perf_table(tmodel3, dTest[vars3], dTest[, outcome] == pos, dTrain[,vars3], dTrain[, outcome] == pos, nK)
##
##
## model auc precision recall f1 dev.norm
## ----------- -------- ----------- -------- -------- ----------
## Model 1 0.9353 0.9173 0.8121 0.8615 11.2390
## KNN Model 0.9824 0.9871 0.9777 0.9824 0.4886
##
## Table: Performance Comparison for Two Models
compare_cmat(tmodel3, dTest[vars3], dTest[, outcome] == pos, dTrain[,vars3], dTrain[, outcome] == pos, nK)
##
##
## FALSE TRUE
## ----------- ------- ------
## FALSE 285 23
## TRUE 59 255
##
## Table: Confusion Matrix for Model 1
##
##
##
## FALSE TRUE
## ----------- ------- ------
## FALSE 304 4
## TRUE 7 307
##
## Table: Confusion Matrix for KNN Model
plot_roc_new(legend.text=c("tmodel3", "KNN Model3"),
tmodel3.pred, dTest.gt,
knn3.pred, dTest.gt)
Feature Set 3: KNN achieves the strongest overall ROC performance using a reduced feature set. This suggests that the selected features provide strong local separability. However, due to the non-parametric nature of KNN, the Decision Tree is preferred for interpretability and insight into feature importance.
Overall, KNN consistently outperforms Decision Tree models in terms of predictive metrics across all feature sets. However, due to its reliance on k = 1 and lack of interpretability, KNN is used primarily as a performance benchmark. Decision Trees provide a better balance between predictive performance and interpretability, making them more suitable for insight-driven analysis.
Local interpretable model-agnostic explanations (LIME) is used to understand the predictions made by the decision tree model for feature set 3. Note: ChatGPT helps in adpating the rpart model to work with LIME.
model_type.rpart <- function(x, ...) {
return("classification")
}
predict_model.rpart <- function(x, newdata, type, ...) {
pred <- predict(x, newdata, type = "prob")
return(as.data.frame(pred))
}
dTrain_subset <- dTrain[, c(vars3, "Death_Rate_Cat"), drop = FALSE]
dTrain_subset$Death_Rate_Cat <- as.factor(dTrain_subset$Death_Rate_Cat)
tmodel.LIME <- rpart(as.formula(paste("Death_Rate_Cat ~", paste(vars3, collapse = " + "))), data = dTrain_subset)
newdata <- dTest[, vars3, drop = FALSE]
pred <- predict(tmodel.LIME, newdata, type = "prob")
explainer <- lime(dTrain_subset[, vars3], model = tmodel.LIME, bin_continuous = TRUE, n_bins = 10)
example.1 <- dTest[123, vars3, drop = FALSE]
explanation.1 <- lime::explain(example.1, explainer, n_labels = 1, n_features = 4)
plot_features(explanation.1)
In this example, features such as Low.birth.weight and Low.bone.mineral.density strongly support a high death rate prediction, resulting in a predicted probability of 0.95.
example.2 <- dTest[456, vars3, drop = FALSE]
explanation.2 <- lime::explain(example.2, explainer, n_labels = 1, n_features = 4)
plot_features(explanation.2)
In the second example, certain features support while others contradict the predicted class, leading to a lower confidence prediction (probability = 0.77).
The LIME explanations show that the selected features in feature set 3 have intuitive and consistent contributions to individual predictions made by tmodel3. For the examined samples, the model relies on meaningful risk factors such as birth weight and metabolic indicators, suggesting that the decision logic is locally interpretable and aligned with domain expectations.
Conclusion
This chapter evaluated classification models for predicting country-level death rate categories using Decision Trees and KNN across three different feature sets. KNN consistently achieved higher predictive performance, serving as a strong benchmark model. However, its non-parametric nature and limited interpretability reduce its suitability for insight-driven analysis.
Decision Trees provided competitive performance while offering clear feature importance and interpretable decision structures. The LIME analysis further demonstrated that the selected features in feature set 3 contribute meaningfully to individual predictions, supporting the use of Decision Trees as a transparent and reliable model for understanding mortality risk patterns.
Clustering is an unsupervised machine learning task used for exploratory data analysis. It groups observations into clusters based on similarity without predefined labels. In this section, two clustering methods, Hierarchical Clustering and k-means, are applied to analyze patterns across 17 aggregated risk factors for 199 countries. The objective is to identify meaningful country groupings and explore potential global health risk patterns.
For the clustering task, the dataset is prepared to meet the following requirements:
Data aggregation: Country-level aggregation is applied to transform the data into a single record per country, representing the cumulative burden of risk factors over time.
Normalization and scaling: Log transformation is applied to the 17 numerical risk factors to reduce skewness, followed by standardization to zero mean and unit variance. This ensures that all features contribute equally to distance-based clustering methods.
The data preparation steps and implementation are adapted from lecture materials.
# data aggregate
d1 <- subset(df_new_1, select = c(numericVars, "Country"))
df_sum <- aggregate(. ~ Country, data = d1, FUN = sum)
nrow(df_sum)
## [1] 199
# log transformation
df <- log_trans(df_sum, numericVars)
scaled_df <- scale(df[,numericVars])
# mean values of all columns
attr(scaled_df, "scaled:center")
## High.systolic.blood.pressure Alcohol.use
## 12.003964 10.342030
## Low.birth.weight Unsafe.sex
## 9.994653 9.540418
## Low.physical.activity High.fasting.plasma.glucose
## 9.298018 11.506639
## High.body.mass.index Drug.use
## 11.290082 8.371313
## Low.bone.mineral.density Vitamin.A.deficiency
## 8.433872 4.628914
## Iron.deficiency Breastfeeding
## 5.505415 7.123939
## Child.growth Tobacco
## 9.388071 11.624384
## Dietary Unsafe.water
## 11.334684 9.428502
## Air.pollution
## 11.371090
# variances of all columns
attr(scaled_df, "scaled:scale")
## High.systolic.blood.pressure Alcohol.use
## 2.273391 2.478146
## Low.birth.weight Unsafe.sex
## 2.832144 2.620205
## Low.physical.activity High.fasting.plasma.glucose
## 2.259595 2.146024
## High.body.mass.index Drug.use
## 2.142521 2.426214
## Low.bone.mineral.density Vitamin.A.deficiency
## 2.428635 4.227074
## Iron.deficiency Breastfeeding
## 3.369143 3.692707
## Child.growth Tobacco
## 3.388698 2.331520
## Dietary Unsafe.water
## 2.297216 3.497627
## Air.pollution
## 2.555058
Calculate Distance
Use dist() to calculate pairwise Euclidean distance of the 199 observations, resulting in 199 choose 2 = 19,701 unique pairs.
dist <- dist(scaled_df, method="euclidean")
Use Ward’s method with hclust(). It minimizes variance with each cluster.
pfit <- hclust(dist, method="ward.D2")
summary(pfit)
## Length Class Mode
## merge 396 -none- numeric
## height 198 -none- numeric
## order 199 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
Visualization: dendrogram
Plot the dendrogram to represent the nested clusters. Start with k=5 for preliminary analysis. To maintain readability, 20% of the countries were selected to be labeled.
set.seed(4009)
selected_labels <- sample(df$Country, size = length(df$Country) * 0.2)
labels_sub <- ifelse(df$Country %in% selected_labels, df$Country, "")
plot(pfit, labels=labels_sub, main="Cluster Dendrogram for Death Data", cex = 0.4)
rect.hclust(pfit, k=5) # 5 clusters
Extract members of each cluster using
cutree()
print_clusters <- function(df, groups, cols_to_print) {
Ngroups <- max(groups)
for (i in 1:Ngroups) {
print(paste("cluster", i))
print(df[groups == i, cols_to_print])
}
}
print_clusters_countries <- function(df, groups) {
Ngroups <- max(groups)
for (i in 1:Ngroups) {
countries_in_cluster <- df[groups == i, "Country"]
countries_str <- paste(countries_in_cluster, collapse = ", ")
print(paste("Cluster", i, ": ", countries_str))
}
}
group.5 <- cutree(pfit, k=5)
# print_clusters(df, group.5, c(numericVars, "Country"))
# print_clusters_countries(df, group.5)
Principal Component Analysis (PCA): PCA used to transform high dimensional data into two dimensions for visualization.
# calculate the principal components
princ <- prcomp(scaled_df)
# first two principal components
nComp <- 2
# project scale_df onto the first 2 principal compoents to form a 2-column data frame
project2D <- as.data.frame(predict(princ, newdata=scaled_df)[,1:nComp])
Combine the project2D with clustering results
hclust.project2D <- cbind(project2D, cluster=as.factor(group.5), Country=df$Country)
head(hclust.project2D)
## PC1 PC2 cluster Country
## 1 -1.1917285 -0.20571289 1 Afghanistan
## 2 0.3549030 0.04446568 2 Albania
## 3 -1.1372186 0.27198225 1 Algeria
## 4 2.8807463 -0.54158050 3 American Samoa
## 5 2.9379324 -0.16611355 3 Andorra
## 6 -0.9905809 -0.53790896 4 Angola
Visualizing Clusters
# finding convex hull
find_convex_hull <- function(proj2Ddf, groups) {
do.call(rbind,
lapply(unique(groups),
FUN = function(c) {
f <- subset(proj2Ddf, cluster==c);
f[chull(f),]
}
)
)
}
hclust.hull <- find_convex_hull(hclust.project2D, group.5)
cols <- brewer.pal(n = 5, name = "Dark2")
plot_theme <- function(){
theme_minimal() +
theme(
text = element_text(size = 12),
plot.title = element_text(hjust = 0.5, face = "bold", margin = margin(b = 25)),
panel.border = element_rect(color = "black", fill = NA, size = 0.75),
panel.grid.major = element_line(color = "grey", linetype = "dashed", size = 0.2),
plot.margin = margin(t = 20, r = 20, b = 15, l = 20),
axis.ticks = element_line(color = "black")
)
}
ggplot(hclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label='', color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.hull, aes(group=cluster, fill=as.factor(cluster)),
alpha=0.4, linetype=0) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
ggtitle("Hierarchical Clustering with k = 5") +
plot_theme()
clusterboot() is used to find out how
stable the clustering algorithm
kbest.p <- 5
cboot.hclust <- clusterboot(scaled_df, clustermethod=hclustCBI,
method="ward.D2", k=kbest.p)
## boot 1
## boot 2
## boot 3
## boot 4
## boot 5
## boot 6
## boot 7
## boot 8
## boot 9
## boot 10
## boot 11
## boot 12
## boot 13
## boot 14
## boot 15
## boot 16
## boot 17
## boot 18
## boot 19
## boot 20
## boot 21
## boot 22
## boot 23
## boot 24
## boot 25
## boot 26
## boot 27
## boot 28
## boot 29
## boot 30
## boot 31
## boot 32
## boot 33
## boot 34
## boot 35
## boot 36
## boot 37
## boot 38
## boot 39
## boot 40
## boot 41
## boot 42
## boot 43
## boot 44
## boot 45
## boot 46
## boot 47
## boot 48
## boot 49
## boot 50
## boot 51
## boot 52
## boot 53
## boot 54
## boot 55
## boot 56
## boot 57
## boot 58
## boot 59
## boot 60
## boot 61
## boot 62
## boot 63
## boot 64
## boot 65
## boot 66
## boot 67
## boot 68
## boot 69
## boot 70
## boot 71
## boot 72
## boot 73
## boot 74
## boot 75
## boot 76
## boot 77
## boot 78
## boot 79
## boot 80
## boot 81
## boot 82
## boot 83
## boot 84
## boot 85
## boot 86
## boot 87
## boot 88
## boot 89
## boot 90
## boot 91
## boot 92
## boot 93
## boot 94
## boot 95
## boot 96
## boot 97
## boot 98
## boot 99
## boot 100
summary(cboot.hclust$result)
## Length Class Mode
## result 7 hclust list
## noise 1 -none- logical
## nc 1 -none- numeric
## clusterlist 5 -none- list
## partition 199 -none- numeric
## clustermethod 1 -none- character
## nccl 1 -none- numeric
groups.cboot <- cboot.hclust$result$partition
# print_clusters_countries(df, groups.cboot)
From the results, cluster 3 and cluster 4 show high stability.
# cboot.hclust$bootbrd = number of times a cluster is desolved.
(values <- 1 - cboot.hclust$bootbrd/100) # large values here => highly stable
## [1] 0.67 0.65 0.97 0.90 0.77
cat("So clusters", order(values)[5], "and", order(values)[4], "are highly stable")
## So clusters 3 and 4 are highly stable
Functions below are used to find the best k value for both hierarchical clustering and kMeans
sqr_euDist <- function(x, y) {
sum((x - y)^2)
}
wss <- function(clustermat) {
c0 <- colMeans(clustermat)
sum(apply( clustermat, 1, FUN=function(row) {sqr_euDist(row, c0)} ))
}
wss_total <- function(scaled_df, labels) {
wss.sum <- 0
k <- length(unique(labels))
for (i in 1:k)
wss.sum <- wss.sum + wss(subset(scaled_df, labels == i))
wss.sum
}
tss <- function(scaled_df) {
wss(scaled_df)
}
CH_index <- function(scaled_df, kmax, method="kmeans") {
if (!(method %in% c("kmeans", "hclust")))
stop("method must be one of c('kmeans', 'hclust')")
npts <- nrow(scaled_df)
wss.value <- numeric(kmax)
wss.value[1] <- wss(scaled_df)
if (method == "kmeans") {
# kmeans
for (k in 2:kmax) {
clustering <- kmeans(scaled_df, k, nstart=10, iter.max=100)
wss.value[k] <- clustering$tot.withinss
}
} else {
# hclust
d <- dist(scaled_df, method="euclidean")
pfit <- hclust(d, method="ward.D2")
for (k in 2:kmax) {
labels <- cutree(pfit, k=k)
wss.value[k] <- wss_total(scaled_df, labels)
}
}
bss.value <- tss(scaled_df) - wss.value
B <- bss.value / (0:(kmax-1))
W <- wss.value / (npts - 1:kmax)
data.frame(k = 1:kmax, CH_index = B/W, WSS = wss.value)
}
plot the CH index vs k
# calculate the CH criterion
crit.df <- CH_index(scaled_df, 10, method="hclust")
fig1 <- ggplot(crit.df, aes(x=k, y=CH_index)) +
geom_point() + geom_line(colour="#D95F02") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="CH index") + plot_theme()
fig2 <- ggplot(crit.df, aes(x=k, y=WSS), color="#1B9E77") +
geom_point() + geom_line(colour="#1B9E77") +
scale_x_continuous(breaks=1:10, labels=1:10) +
plot_theme()
grid.arrange(fig1, fig2, nrow=1)
The CH index suggests k=2 or k=3 for optimal clustering, while the WSS indicates k=3 as the best choice. Combining these, k=3 is recommended.
kbest.p <- 5
# run kmeans with 5 clusters, 100 random starts, and 100 maximum iterations per run.
kmClusters <- kmeans(scaled_df, kbest.p, nstart=100, iter.max=100)
Centres and sizes of the 5 Clusters
kmClusters$centers
## High.systolic.blood.pressure Alcohol.use Low.birth.weight Unsafe.sex
## 1 -0.4706631 -0.5215121 -0.1852793 -0.3249209
## 2 -1.6036906 -1.5455085 -1.5118015 -1.4947068
## 3 1.4004303 1.2671184 0.9699850 0.9412140
## 4 0.4529530 0.4528758 -0.4027673 -0.1279156
## 5 0.2846426 0.3532153 0.8584614 0.7662272
## Low.physical.activity High.fasting.plasma.glucose High.body.mass.index
## 1 -0.4625711 -0.4514388 -0.4356600
## 2 -1.3859112 -1.5665752 -1.5480451
## 3 1.4235463 1.4128669 1.4198448
## 4 0.5600976 0.3742936 0.4641758
## 5 0.0521484 0.2958558 0.2069019
## Drug.use Low.bone.mineral.density Vitamin.A.deficiency Iron.deficiency
## 1 -0.5054319 -0.4747196 -0.06415194 -0.1173993
## 2 -1.5008970 -1.5789153 -0.91823220 -1.3271058
## 3 1.4223241 1.2874992 0.42164082 0.7756671
## 4 0.3947061 0.4228103 -0.92917814 -0.6490869
## 5 0.2709901 0.3529593 1.05505100 0.9693099
## Breastfeeding Child.growth Tobacco Dietary Unsafe.water Air.pollution
## 1 -0.0699773 -0.1780419 -0.5065301 -0.4720642 -0.1496518 -0.35373890
## 2 -1.4394705 -1.3751612 -1.5411169 -1.5975280 -1.3590615 -1.66778539
## 3 0.8005118 0.8479230 1.4291260 1.3850073 0.7716393 1.21251305
## 4 -0.5412298 -0.5613945 0.5152524 0.4580107 -0.5645836 0.07853893
## 5 0.9209407 0.9415731 0.2099951 0.2863717 0.9550458 0.60792754
kmClusters$size
## [1] 38 36 30 39 56
cat("Total of cluster sizes =", sum(kmClusters$size))
## Total of cluster sizes = 199
cat("Total number of observations =", nrow(df))
## Total number of observations = 199
Extracted groups
groups <- kmClusters$cluster
# print_clusters_countries(df, groups, c(numericVars, "Country"))
kmClustering.ch <- kmeansruns(scaled_df, krange=1:10, criterion="ch")
kmClustering.ch$bestk
## [1] 2
kmClustering.asw <- kmeansruns(scaled_df, krange=1:10, criterion="asw")
kmClustering.asw$bestk
## [1] 2
# Compare the CH values for kmeans() and hclust().
print("CH index from kmeans for k=1 to 10:")
## [1] "CH index from kmeans for k=1 to 10:"
print(kmClustering.ch$crit)
## [1] 0.0000 217.7845 203.1375 197.4558 192.3847 184.0460 175.3755 171.8113
## [9] 166.8118 165.2752
print("CH index from hclust for k=1 to 10:")
## [1] "CH index from hclust for k=1 to 10:"
hclusting <- CH_index(scaled_df, 10, method="hclust")
print(hclusting$CH_index)
## [1] NaN 194.1658 188.5019 175.4456 175.4816 161.5447 153.6342 152.0306
## [9] 149.5742 150.5552
Code to plot CH and ASW against k
library(gridExtra)
kmCritframe <- data.frame(k=1:10, ch=kmClustering.ch$crit,
asw=kmClustering.asw$crit)
fig1 <- ggplot(kmCritframe, aes(x=k, y=ch)) +
geom_point() + geom_line(colour="#D95F02") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="CH index") + theme(text=element_text(size=20)) +
plot_theme()
fig2 <- ggplot(kmCritframe, aes(x=k, y=asw)) +
geom_point() + geom_line(colour="#1B9E77") +
scale_x_continuous(breaks=1:10, labels=1:10) +
labs(y="ASW") + theme(text=element_text(size=20)) +
plot_theme()
grid.arrange(fig1, fig2, nrow=1)
fig <- c()
kvalues <- seq(2,5)
for (k in kvalues) {
groups <- kmeans(scaled_df, k, nstart=100, iter.max=100)$cluster
kmclust.project2D <- cbind(project2D, cluster=as.factor(groups),
Country=df$Country)
kmclust.hull <- find_convex_hull(kmclust.project2D, groups)
assign(paste0("fig", k),
ggplot(kmclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_polygon(data=kmclust.hull, aes(group=cluster, fill=cluster),
alpha=0.4, linetype=0) +
labs(title = sprintf("k = %d", k)) + plot_theme()
)
}
Visualisation of clusterings from kmeans
grid.arrange(fig2, fig3, fig4, fig5, nrow=2)
PC1 captures a combination of metabolic, environmental, and behavioral risk factors, reflecting broad public health pressures.
PC2 is primarily driven by nutritional and developmental risks, highlighting child and maternal health dimensions.
summary(princ)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.667 1.6653 0.54223 0.2972 0.2798 0.27220 0.22454
## Proportion of Variance 0.791 0.1631 0.01729 0.0052 0.0046 0.00436 0.00297
## Cumulative Proportion 0.791 0.9541 0.97139 0.9766 0.9812 0.98555 0.98852
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.2100 0.18294 0.16064 0.15795 0.13253 0.12582 0.10828
## Proportion of Variance 0.0026 0.00197 0.00152 0.00147 0.00103 0.00093 0.00069
## Cumulative Proportion 0.9911 0.99308 0.99460 0.99606 0.99710 0.99803 0.99872
## PC15 PC16 PC17
## Standard deviation 0.10281 0.09079 0.05451
## Proportion of Variance 0.00062 0.00048 0.00017
## Cumulative Proportion 0.99934 0.99983 1.00000
loadings <- princ$rotation
print(loadings[, 1:2])
## PC1 PC2
## High.systolic.blood.pressure -0.2577389 0.187375477
## Alcohol.use -0.2490210 0.150255220
## Low.birth.weight -0.2507422 -0.212727319
## Unsafe.sex -0.2435665 -0.122809984
## Low.physical.activity -0.2300385 0.285511401
## High.fasting.plasma.glucose -0.2576155 0.176870560
## High.body.mass.index -0.2501258 0.219190967
## Drug.use -0.2504122 0.178259910
## Low.bone.mineral.density -0.2542712 0.166348352
## Vitamin.A.deficiency -0.1674627 -0.452697670
## Iron.deficiency -0.2262447 -0.314747905
## Breastfeeding -0.2318460 -0.297309452
## Child.growth -0.2336730 -0.294927651
## Tobacco -0.2498326 0.222219094
## Dietary -0.2564360 0.185745613
## Unsafe.water -0.2284604 -0.313632733
## Air.pollution -0.2687014 -0.007790184
best k = 3
group.3 <- cutree(pfit, k=3)
hclust.project2D <- cbind(project2D, cluster=as.factor(group.3), Country=df$Country)
hclust.hull <- find_convex_hull(hclust.project2D, group.3)
cols <- brewer.pal(n = 3, name = "Dark2")
ggplot(hclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_text(aes(label='', color=cluster), hjust=0, vjust=1, size=3) +
geom_polygon(data=hclust.hull, aes(group=cluster, fill=as.factor(cluster)),
alpha=0.4, linetype=0) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
ggtitle("Hierarchical Clustering with k = 3") +
plot_theme()
# print_clusters(df, group.3, c(numericVars, "Country"))
# print_clusters_countries(df, group.3)
Merge with Death Data and calculate how many high death rate years
merge_k2 <- merge(d, hclust.project2D, by = "Country")
death_k2 <- aggregate(Death_Rate_Cat ~ cluster, data = merge_k2, FUN = function(x) sum(x == '1'))
death_k2 %>% kable()
| cluster | Death_Rate_Cat |
|---|---|
| 1 | 1958 |
| 2 | 763 |
| 3 | 264 |
Interpretation
Cluster1 (green):This cluster has the most high death rate years at 1958, indicating major health challenges. It includes countries like China, India, and the USA, which balance in environmental, behavioral, and metabolic risks but face some nutritional challenges.
Cluster2 (orange): With 763 high death rate years, this cluster manages health better than Cluster 1. High-income nations like Australia and Canada, are good in lifestyle disease management but facing child and maternal health issues.
Cluster3 (purple): The lowest count of high death rate years at 264 shows that these countries have minimal health risks and effective management. Smaller countries such as Luxembourg and Maldives, with minimal health risks and strong developmental health.
Using k = 2
kmClusters.2 <- kmeans(scaled_df, 2, nstart=100, iter.max=100)
groups.2 <- kmClusters.2 $cluster
kmClusters.2 $centers
## High.systolic.blood.pressure Alcohol.use Low.birth.weight Unsafe.sex
## 1 0.4925056 0.4900929 0.4536706 0.4633381
## 2 -1.2576481 -1.2514872 -1.1584802 -1.1831670
## Low.physical.activity High.fasting.plasma.glucose High.body.mass.index
## 1 0.431274 0.4802037 0.4776813
## 2 -1.101289 -1.2262344 -1.2197933
## Drug.use Low.bone.mineral.density Vitamin.A.deficiency Iron.deficiency
## 1 0.4697689 0.4808924 0.2863152 0.4014351
## 2 -1.1995885 -1.2279932 -0.7311264 -1.0250931
## Breastfeeding Child.growth Tobacco Dietary Unsafe.water Air.pollution
## 1 0.4230363 0.4214129 0.4762984 0.4880479 0.4096835 0.4953506
## 2 -1.0802535 -1.0761080 -1.2162620 -1.2462652 -1.0461560 -1.2649132
kmClusters.2 $size
## [1] 143 56
cat("Total of cluster sizes =", sum(kmClusters.2 $size))
## Total of cluster sizes = 199
cat("Total number of observations =", nrow(df))
## Total number of observations = 199
kmclust.project2D <- cbind(project2D, cluster=as.factor(groups.2),Country=df$Country)
kmclust.hull <- find_convex_hull(kmclust.project2D, groups.2)
ggplot(kmclust.project2D, aes(x=PC1, y=PC2)) +
geom_point(aes(shape=cluster, color=cluster)) +
geom_polygon(data=kmclust.hull, aes(group=cluster, fill=cluster),
alpha=0.4, linetype=0) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols) +
ggtitle("kMeans Clustering with k = 2") +
plot_theme()
Cluster 1: Higher Risks: Countries in this cluster have higher values across multiple health risks, representing a widespread challenge. Consists of 143 countries, showing a common pattern of higher health risks.
Cluster 2: Lower Risks: This cluster’s countries show lower health risk levels, indicating better health management. Smaller Cluster: Includes 56 countries, suggesting these health advantages are less common.
Overall, hierarchical clustering revealed more granular health risk patterns across countries, while k-means clustering provided a clearer separation between high-risk and low-risk groups. The consistency between clustering results and death-rate classifications further supports the validity of the identified clusters.
This chapter applied hierarchical clustering and k-means clustering
to identify groups of countries with similar health risk profiles.
By evaluating different values of k, the analysis demonstrated
how each clustering method reveals health patterns at different levels
of granularity.
Hierarchical clustering provided more detailed group structures,
while k-means clustering offered a clearer separation between high-risk
and low-risk countries.
These clustering results support data-driven insights that can inform
targeted public health strategies.
This project explored global health risks and mortality patterns
using exploratory data analysis and machine learning techniques.
EDA revealed key risk factors and global trends, highlighting major
contributors to death rates across countries.
Supervised learning models, including Decision Trees and k-Nearest
Neighbors (KNN), were used to classify countries into high and low
death-rate groups.
Among the tested feature sets, the PCA-based feature set performed
particularly well, with KNN consistently outperforming Decision
Trees.
LIME was applied to improve model interpretability by explaining
individual predictions.
Unsupervised clustering methods further identified groups of
countries with similar health risk profiles, complementing the
classification results and supporting targeted health planning.
Overall, this project demonstrates an end-to-end analytics and modelling
pipeline in R and shows how data-driven methods can support public
health insights and decision-making.
Lecture Materials: [Lecture materials used in Chapters 3 and 4 for building machine learning models, University Course Materials, 2024].
ChatGPT (GPT-4, OpenAI, 2024) was used to assist with understanding R functions and implementation details for selected machine learning techniques.
World Health Organization (WHO). Death Data: Yearly number of deaths in 228 entities from 1990 to 2019, covering various causes of death. Provided in Project Instruction.
World Bank. Population Data: Total population by country over years. Available at: https://data.worldbank.org/indicator/SP.POP.TOTL.
World Bank. GDP Data: GDP per capita (current US$). Available at: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD.
Waring, E., Quinn, M., McNamara, A., Arino de la Rubia, E., Zhu, H., Lowndes, J., Ellis, S., & Stewart, H. M. (2022). Skimr: Compact and Flexible Summaries of Data (R package version 2.1.5). https://doi.org/10.32614/CRAN.package.skimr. Retrieved from https://CRAN.R-project.org/package=skimr
Wickham, H. (2020). tidyr: Tidy Messy Data. R package version 1.1.2. Available at tidyr pivot_longer documentation.](https://tidyr.tidyverse.org/reference/pivot_longer.html).)
Arel-Bundock, V., Enevoldsen, N., Yetman, C. (2023). countrycode: Convert Country Names and Country Codes (R package version 1.4.0). Available at: https://cran.r-project.org/web/packages/countrycode/index.html.
IHME, Global Burden of Disease (2024) – with minor processing by Our World in Data. “Alcohol use” [dataset]. IHME, Global Burden of Disease, “Global Burden of Disease - Risk Factors” [original data].